Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
The fundamental Pandas data structures:
import pandas as pd
import numpy as np
# create a numpy array
a = np.array([4, 8, 3, 20])
a
array([ 4, 8, 3, 20])
# create a pandas series
b = pd.Series([4, 8, 3, 20])
b
0 4 1 8 2 3 3 20 dtype: int64
b.values # this is a numpy array just like a
array([ 4, 8, 3, 20], dtype=int64)
b.index
RangeIndex(start=0, stop=4, step=1)
b[3] # slicing just like numpy
20
# string index
x = pd.Series([4, 8, 3, 20], index=['apple', 'banana', 'cherry', 'kiwi'])
x
apple 4 banana 8 cherry 3 kiwi 20 dtype: int64
# x[2] and x['cherry'] access the same element
print(x[2])
print(x['cherry'])
3 3
A pandas series with string indexes are like a Python dictionary.
# This is a python dictionary
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population_dict['New York']
19651127
# Create a pandas sieries from a dictionary
population = pd.Series(population_dict)
population
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
population['Florida']
19552860
population['New York':'Illinois'] # you cannot do this kind of slicing with a Python dict
New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
population[:'Illinois']
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
The following figure defines different components of a dataframe.
# how to display an image
from IPython.display import Image
Image('../img/pandas-dataframe.png')
# create another pandas series
area_dict = {'California': 423967,
'Texas': 695662,
'New York': 141297,
'Florida': 170312,
'Illinois': 149995}
area = pd.Series(area_dict)
area
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 dtype: int64
# combine two series to form a dataframe
states = pd.DataFrame({'population': population,
'area': area})
states
| population | area | |
|---|---|---|
| California | 38332521 | 423967 |
| Texas | 26448193 | 695662 |
| New York | 19651127 | 141297 |
| Florida | 19552860 | 170312 |
| Illinois | 12882135 | 149995 |
print(states.index)
print(states.columns)
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object') Index(['population', 'area'], dtype='object')
# create a dataframe from a two dimensional array with implicit index (not specifying indexes)
# you can see the indexes are implicit integers
basket = pd.DataFrame([['apple', 2, 1], ['orange', 5, 1.5], ['kiwi', 4, 2], ['grape', 3, 3], ['cherry', 25, 3.5]])
basket
| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | apple | 2 | 1.0 |
| 1 | orange | 5 | 1.5 |
| 2 | kiwi | 4 | 2.0 |
| 3 | grape | 3 | 3.0 |
| 4 | cherry | 25 | 3.5 |
# the most common way, we have a dataframe is with column names (explicit column indexes)
# the row index here is still integer-based implicit index
basket = pd.DataFrame([['apple', 2, 1], ['orange', 5, 1.5], ['kiwi', 4, 2], ['grape', 3, 3], ['cherry', 25, 3.5]],
columns=['item', 'quantity', 'price'])
basket
| item | quantity | price | |
|---|---|---|---|
| 0 | apple | 2 | 1.0 |
| 1 | orange | 5 | 1.5 |
| 2 | kiwi | 4 | 2.0 |
| 3 | grape | 3 | 3.0 |
| 4 | cherry | 25 | 3.5 |
iloc[] integer-location based indexing for selection by position. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
The syntax is like iloc[row, column] - starting from 0
# one row
basket.iloc[2] # the third row, same as basket.iloc[2, :]
item kiwi quantity 4 price 2.0 Name: 2, dtype: object
# one column
basket.iloc[:, 0]
0 apple 1 orange 2 kiwi 3 grape 4 cherry Name: item, dtype: object
basket.iloc[0:3, 0]
0 apple 1 orange 2 kiwi Name: item, dtype: object
# multi rows, all columns
basket.iloc[2:4, :]
| item | quantity | price | |
|---|---|---|---|
| 2 | kiwi | 4 | 2.0 |
| 3 | grape | 3 | 3.0 |
basket[2:4] # same as a shorthand
| item | quantity | price | |
|---|---|---|---|
| 2 | kiwi | 4 | 2.0 |
| 3 | grape | 3 | 3.0 |
# multi rows and column
basket.iloc[2:4, 0:2]
| item | quantity | |
|---|---|---|
| 2 | kiwi | 4 |
| 3 | grape | 3 |
loc[]: Access a group of rows and columns by label(s) or a boolean array. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
The syntax is like loc[row, column]
# one row
# note that 1 is considered a label not the integer index - although in this case they are the same
basket.loc[1]
item orange quantity 5 price 1.5 Name: 1, dtype: object
# one column
basket.loc[:, ['item']]
# basket.loc[:, 'item'] # same
| item | |
|---|---|
| 0 | apple |
| 1 | orange |
| 2 | kiwi |
| 3 | grape |
| 4 | cherry |
basket['item'] # same - shorthand
0 apple 1 orange 2 kiwi 3 grape 4 cherry Name: item, dtype: object
# multiple columns
basket.loc[:, ['item', 'price']]
| item | price | |
|---|---|---|
| 0 | apple | 1.0 |
| 1 | orange | 1.5 |
| 2 | kiwi | 2.0 |
| 3 | grape | 3.0 |
| 4 | cherry | 3.5 |
basket[['item', 'price']] # same - shorthand
| item | price | |
|---|---|---|
| 0 | apple | 1.0 |
| 1 | orange | 1.5 |
| 2 | kiwi | 2.0 |
| 3 | grape | 3.0 |
| 4 | cherry | 3.5 |
# multi rows and columns by labels
basket.loc[1:3, ['item', 'price']] # 1, 3 are still treated as labels
#basket.loc[[1, 3], ['item', 'price']] # different, think about the result
| item | price | |
|---|---|---|
| 1 | orange | 1.5 |
| 2 | kiwi | 2.0 |
| 3 | grape | 3.0 |
basket.loc[[1, 3], ['item', 'price']]
| item | price | |
|---|---|---|
| 1 | orange | 1.5 |
| 3 | grape | 3.0 |
# select multiple columns
basket[['item', 'quantity']]
| item | quantity | |
|---|---|---|
| 0 | apple | 2 |
| 1 | orange | 5 |
| 2 | kiwi | 4 |
| 3 | grape | 3 |
| 4 | cherry | 25 |
basket.iloc[[0,2,4], ]
| item | quantity | price | |
|---|---|---|---|
| 0 | apple | 2 | 1.0 |
| 2 | kiwi | 4 | 2.0 |
| 4 | cherry | 25 | 3.5 |
basket.iloc[1:4]
| item | quantity | price | |
|---|---|---|---|
| 1 | orange | 5 | 1.5 |
| 2 | kiwi | 4 | 2.0 |
| 3 | grape | 3 | 3.0 |
basket.loc[1:3, :]
| item | quantity | price | |
|---|---|---|---|
| 1 | orange | 5 | 1.5 |
| 2 | kiwi | 4 | 2.0 |
| 3 | grape | 3 | 3.0 |
#Get the rows: index 1~3; columns: item, price
basket.iloc[1:4, [0,2]]
| item | price | |
|---|---|---|
| 1 | orange | 1.5 |
| 2 | kiwi | 2.0 |
| 3 | grape | 3.0 |
basket.loc[1:3, ['item','price']]
| item | price | |
|---|---|---|
| 1 | orange | 1.5 |
| 2 | kiwi | 2.0 |
| 3 | grape | 3.0 |
# load dataset from csv with no headers
# by default the first row is considered the header
# the following dataframe would be wrong!!!
df = pd.read_csv('../data/customer-churn-example-simple-noheader.csv')
df.head(3) # show the first three rows
| KS | 128 | no | 265.1 | 110 | 1 | FALSE | |
|---|---|---|---|---|---|---|---|
| 0 | OH | 107 | no | 161.6 | 123 | 1 | False |
| 1 | NJ | 137 | no | 243.4 | 114 | 0 | False |
| 2 | OH | 84 | yes | 299.4 | 71 | 2 | False |
df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | no | 265.1 | 110 | 1 | False |
| 1 | OH | 107 | no | 161.6 | 123 | 1 | False |
| 2 | NJ | 137 | no | 243.4 | 114 | 0 | False |
| 3 | OH | 84 | yes | 299.4 | 71 | 2 | False |
| 4 | OK | 75 | yes | 166.7 | 113 | 3 | False |
# load dataset from csv with no headers
# tell pandas there is no header
df = pd.read_csv('../data/customer-churn-example-simple-noheader.csv', header=None)
df.head(3) # show the first three rows
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | no | 265.1 | 110 | 1 | False |
| 1 | OH | 107 | no | 161.6 | 123 | 1 | False |
| 2 | NJ | 137 | no | 243.4 | 114 | 0 | False |
# load dataset from csv with no headers
# add header manually - simplified column names
df = pd.read_csv('../data/customer-churn-example-simple-noheader.csv',
names=['state',
'acct len',
'inter plan',
'total day min',
'total day calls',
'service calls',
'churn'
])
df.head(3) # show the first three rows
| state | acct len | inter plan | total day min | total day calls | service calls | churn | |
|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | no | 265.1 | 110 | 1 | False |
| 1 | OH | 107 | no | 161.6 | 123 | 1 | False |
| 2 | NJ | 137 | no | 243.4 | 114 | 0 | False |
df.columns = names=['state',
'acct len',
'inter plan',
'total day min',
'total day calls',
'service calls',
'churn'
]
df
| state | acct len | inter plan | total day min | total day calls | service calls | churn | |
|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | no | 265.1 | 110 | 1 | False |
| 1 | OH | 107 | no | 161.6 | 123 | 1 | False |
| 2 | NJ | 137 | no | 243.4 | 114 | 0 | False |
| 3 | OH | 84 | yes | 299.4 | 71 | 2 | False |
| 4 | OK | 75 | yes | 166.7 | 113 | 3 | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3328 | AZ | 192 | no | 156.2 | 77 | 2 | False |
| 3329 | WV | 68 | no | 231.1 | 57 | 3 | False |
| 3330 | RI | 28 | no | 180.8 | 109 | 2 | False |
| 3331 | CT | 184 | yes | 213.8 | 105 | 2 | False |
| 3332 | TN | 74 | no | 234.4 | 113 | 0 | False |
3333 rows × 7 columns
# load dataset from csv with headers
# use a different csv with the header
df = pd.read_csv('../data/customer-churn-example-simple.csv')
df.head(3) # show the first three rows
| state | account length | international plan | total day minutes | total day calls | customer service calls | churn | |
|---|---|---|---|---|---|---|---|
| 0 | KS | 128 | no | 265.1 | 110 | 1 | False |
| 1 | OH | 107 | no | 161.6 | 123 | 1 | False |
| 2 | NJ | 137 | no | 243.4 | 114 | 0 | False |
# shape of the df: rows and columns
df.shape
(3333, 7)
df.shape[0]
7
df.shape[1]
7
# the data types of each column
df.dtypes
state object account length int64 international plan object total day minutes float64 total day calls int64 customer service calls int64 churn object dtype: object
# change churn to string from boolean for histogram
df['churn'] = df['churn'].apply(str)
df.dtypes
state object account length int64 international plan object total day minutes float64 total day calls int64 customer service calls int64 churn object dtype: object
ax = df.churn.hist()
# change account length from int to float
# not necesary - just for demostration purpose
df['account length'] = df['account length'].apply(float)
df.dtypes
state object account length float64 international plan object total day minutes float64 total day calls int64 customer service calls int64 churn object dtype: object
# combine the commands above in one
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3333 entries, 0 to 3332 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 3333 non-null object 1 account length 3333 non-null float64 2 international plan 3333 non-null object 3 total day minutes 3333 non-null float64 4 total day calls 3333 non-null int64 5 customer service calls 3333 non-null int64 6 churn 3333 non-null object dtypes: float64(2), int64(2), object(3) memory usage: 182.4+ KB
# descriptive stats for numerical columns
df.describe()
| account length | total day minutes | total day calls | customer service calls | |
|---|---|---|---|---|
| count | 3333.000000 | 3333.000000 | 3333.000000 | 3333.000000 |
| mean | 101.064806 | 179.775098 | 100.435644 | 1.562856 |
| std | 39.822106 | 54.467389 | 20.069084 | 1.315491 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 74.000000 | 143.700000 | 87.000000 | 1.000000 |
| 50% | 101.000000 | 179.400000 | 101.000000 | 1.000000 |
| 75% | 127.000000 | 216.400000 | 114.000000 | 2.000000 |
| max | 243.000000 | 350.800000 | 165.000000 | 9.000000 |
# descriptive stats for numerical and categorical columns
df.describe(include='all')
| state | account length | international plan | total day minutes | total day calls | customer service calls | churn | |
|---|---|---|---|---|---|---|---|
| count | 3333 | 3333.000000 | 3333 | 3333.000000 | 3333.000000 | 3333.000000 | 3333 |
| unique | 51 | NaN | 2 | NaN | NaN | NaN | 2 |
| top | WV | NaN | no | NaN | NaN | NaN | False |
| freq | 106 | NaN | 3010 | NaN | NaN | NaN | 2850 |
| mean | NaN | 101.064806 | NaN | 179.775098 | 100.435644 | 1.562856 | NaN |
| std | NaN | 39.822106 | NaN | 54.467389 | 20.069084 | 1.315491 | NaN |
| min | NaN | 1.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | NaN |
| 25% | NaN | 74.000000 | NaN | 143.700000 | 87.000000 | 1.000000 | NaN |
| 50% | NaN | 101.000000 | NaN | 179.400000 | 101.000000 | 1.000000 | NaN |
| 75% | NaN | 127.000000 | NaN | 216.400000 | 114.000000 | 2.000000 | NaN |
| max | NaN | 243.000000 | NaN | 350.800000 | 165.000000 | 9.000000 | NaN |
# you can do calculation on features
df['avg call minutes'] = df['total day minutes']/df['total day calls']
df.head()
| state | account length | international plan | total day minutes | total day calls | customer service calls | churn | avg call minutes | |
|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128.0 | no | 265.1 | 110 | 1 | False | 2.410000 |
| 1 | OH | 107.0 | no | 161.6 | 123 | 1 | False | 1.313821 |
| 2 | NJ | 137.0 | no | 243.4 | 114 | 0 | False | 2.135088 |
| 3 | OH | 84.0 | yes | 299.4 | 71 | 2 | False | 4.216901 |
| 4 | OK | 75.0 | yes | 166.7 | 113 | 3 | False | 1.475221 |
# count unique values of categorical variable
churn_count = df['churn'].value_counts() # return a pandas series with each category as an index
print(churn_count)
print(churn_count.index)
print(churn_count.values)
print(f'Churn rate = {churn_count.values[1]/sum(churn_count):.2%}') # f string formatting number: {variable:.2f} means two decimals
False 2850 True 483 Name: churn, dtype: int64 Index(['False', 'True'], dtype='object') [2850 483] Churn rate = 14.49%
churn_count.plot(kind='bar')
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_14436/3219840281.py in <module> ----> 1 df['churn'].count.value.plot(kind='bar') AttributeError: 'function' object has no attribute 'value'
Here are some execerises for you:
# Retrieve the first 10 rows in df:
df.head(10)
| state | account length | international plan | total day minutes | total day calls | customer service calls | churn | avg call minutes | |
|---|---|---|---|---|---|---|---|---|
| 0 | KS | 128.0 | no | 265.1 | 110 | 1 | False | 2.410000 |
| 1 | OH | 107.0 | no | 161.6 | 123 | 1 | False | 1.313821 |
| 2 | NJ | 137.0 | no | 243.4 | 114 | 0 | False | 2.135088 |
| 3 | OH | 84.0 | yes | 299.4 | 71 | 2 | False | 4.216901 |
| 4 | OK | 75.0 | yes | 166.7 | 113 | 3 | False | 1.475221 |
| 5 | AL | 118.0 | yes | 223.4 | 98 | 0 | False | 2.279592 |
| 6 | MA | 121.0 | no | 218.2 | 88 | 3 | False | 2.479545 |
| 7 | MO | 147.0 | yes | 157.0 | 79 | 0 | False | 1.987342 |
| 8 | LA | 117.0 | no | 184.5 | 97 | 1 | False | 1.902062 |
| 9 | WV | 141.0 | yes | 258.6 | 84 | 0 | False | 3.078571 |
# Retrieve the 10th rows in df:
df.iloc[9]
state WV account length 141.0 international plan yes total day minutes 258.6 total day calls 84 customer service calls 0 churn False avg call minutes 3.078571 Name: 9, dtype: object
# Retrieve the first 3 columns in df:
df.iloc[:,0:3]
| state | account length | international plan | |
|---|---|---|---|
| 0 | KS | 128.0 | no |
| 1 | OH | 107.0 | no |
| 2 | NJ | 137.0 | no |
| 3 | OH | 84.0 | yes |
| 4 | OK | 75.0 | yes |
| ... | ... | ... | ... |
| 3328 | AZ | 192.0 | no |
| 3329 | WV | 68.0 | no |
| 3330 | RI | 28.0 | no |
| 3331 | CT | 184.0 | yes |
| 3332 | TN | 74.0 | no |
3333 rows × 3 columns
# Retrieve the 3rd column in df:
df[['international plan']]
| international plan | |
|---|---|
| 0 | no |
| 1 | no |
| 2 | no |
| 3 | yes |
| 4 | yes |
| ... | ... |
| 3328 | no |
| 3329 | no |
| 3330 | no |
| 3331 | yes |
| 3332 | no |
3333 rows × 1 columns
# Retrieve all rows in the two columsn 'total day minutes' and 'total day calls'
df.loc[:,['total day minutes','total day calls']]
| total day minutes | total day calls | |
|---|---|---|
| 0 | 265.1 | 110 |
| 1 | 161.6 | 123 |
| 2 | 243.4 | 114 |
| 3 | 299.4 | 71 |
| 4 | 166.7 | 113 |
| ... | ... | ... |
| 3328 | 156.2 | 77 |
| 3329 | 231.1 | 57 |
| 3330 | 180.8 | 109 |
| 3331 | 213.8 | 105 |
| 3332 | 234.4 | 113 |
3333 rows × 2 columns
# Retrieve the frist 10 rows of the two columns 'total day minutes' and 'total day calls'
df.loc[0:9,['total day minutes','total day calls']]
| total day minutes | total day calls | |
|---|---|---|
| 0 | 265.1 | 110 |
| 1 | 161.6 | 123 |
| 2 | 243.4 | 114 |
| 3 | 299.4 | 71 |
| 4 | 166.7 | 113 |
| 5 | 223.4 | 98 |
| 6 | 218.2 | 88 |
| 7 | 157.0 | 79 |
| 8 | 184.5 | 97 |
| 9 | 258.6 | 84 |
# Retrieve the eight rows from the 11th to the 18th of the two columns 'total day minutes' and 'total day calls'
df.loc[10:17,['total day minutes','total day calls']]
| total day minutes | total day calls | |
|---|---|---|
| 10 | 129.1 | 137 |
| 11 | 187.7 | 127 |
| 12 | 128.8 | 96 |
| 13 | 156.6 | 88 |
| 14 | 120.7 | 70 |
| 15 | 332.9 | 67 |
| 16 | 196.4 | 139 |
| 17 | 190.7 | 114 |
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
The fundamental Pandas data structures:
When you get a dataset to analyze, it is rare that the data set is clean or in exactly the right form you need. Often you’ll need to perform some data preprocessing/wrangling, e.g., creating some new variables or summaries, filtering out some rows based on certain search criteria, renaming the variables, reordering the observations by some column, etc.
In this notebook, you will learn how to perform a variety of data preprocessing tasks. Here, we will use a dataset on flights departing New York City in 2013.
import pandas as pd
import numpy as np
# Install the package 'nycflights13' before you can run this
from nycflights13 import flights
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
flights.shape
(336776, 19)
list(flights.columns)
['year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight', 'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour']
Date of departure
Actual departure and arrival times (format HHMM or HMM), local tz.
Scheduled departure and arrival times (format HHMM or HMM), local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Time of scheduled departure broken into hour and minutes.
Two letter carrier abbreviation. See airlines() to get name
Plane tail number
Flight number
Origin and destination. See airports() for additional metadata.
Amount of time spent in the air, in minutes
Distance between airports, in miles
Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.
flights.dtypes
year int64 month int64 day int64 dep_time float64 sched_dep_time int64 dep_delay float64 arr_time float64 sched_arr_time int64 arr_delay float64 carrier object flight int64 tailnum object origin object dest object air_time float64 distance int64 hour int64 minute int64 time_hour object dtype: object
flights.describe(include='all')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 336776.0 | 336776.000000 | 336776.000000 | 328521.000000 | 336776.000000 | 328521.000000 | 328063.000000 | 336776.000000 | 327346.000000 | 336776 | 336776.000000 | 334264 | 336776 | 336776 | 327346.000000 | 336776.000000 | 336776.000000 | 336776.000000 | 336776 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 16 | NaN | 4043 | 3 | 105 | NaN | NaN | NaN | NaN | 6936 |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UA | NaN | N725MQ | EWR | ORD | NaN | NaN | NaN | NaN | 2013-09-13T12:00:00Z |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58665 | NaN | 575 | 120835 | 17283 | NaN | NaN | NaN | NaN | 94 |
| mean | 2013.0 | 6.548510 | 15.710787 | 1349.109947 | 1344.254840 | 12.639070 | 1502.054999 | 1536.380220 | 6.895377 | NaN | 1971.923620 | NaN | NaN | NaN | 150.686460 | 1039.912604 | 13.180247 | 26.230100 | NaN |
| std | 0.0 | 3.414457 | 8.768607 | 488.281791 | 467.335756 | 40.210061 | 533.264132 | 497.457142 | 44.633292 | NaN | 1632.471938 | NaN | NaN | NaN | 93.688305 | 733.233033 | 4.661316 | 19.300846 | NaN |
| min | 2013.0 | 1.000000 | 1.000000 | 1.000000 | 106.000000 | -43.000000 | 1.000000 | 1.000000 | -86.000000 | NaN | 1.000000 | NaN | NaN | NaN | 20.000000 | 17.000000 | 1.000000 | 0.000000 | NaN |
| 25% | 2013.0 | 4.000000 | 8.000000 | 907.000000 | 906.000000 | -5.000000 | 1104.000000 | 1124.000000 | -17.000000 | NaN | 553.000000 | NaN | NaN | NaN | 82.000000 | 502.000000 | 9.000000 | 8.000000 | NaN |
| 50% | 2013.0 | 7.000000 | 16.000000 | 1401.000000 | 1359.000000 | -2.000000 | 1535.000000 | 1556.000000 | -5.000000 | NaN | 1496.000000 | NaN | NaN | NaN | 129.000000 | 872.000000 | 13.000000 | 29.000000 | NaN |
| 75% | 2013.0 | 10.000000 | 23.000000 | 1744.000000 | 1729.000000 | 11.000000 | 1940.000000 | 1945.000000 | 14.000000 | NaN | 3465.000000 | NaN | NaN | NaN | 192.000000 | 1389.000000 | 17.000000 | 44.000000 | NaN |
| max | 2013.0 | 12.000000 | 31.000000 | 2400.000000 | 2359.000000 | 1301.000000 | 2400.000000 | 2359.000000 | 1272.000000 | NaN | 8500.000000 | NaN | NaN | NaN | 695.000000 | 4983.000000 | 23.000000 | 59.000000 | NaN |
You will learn the five key operations that allow you to solve the vast majority of your data manipulation challenges:
These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
# Filter rows
# Select all flights in January:
flights.loc[flights['month']==1]
# flights.loc[flights.month==1]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26999 | 2013 | 1 | 31 | NaN | 1325 | NaN | NaN | 1505 | NaN | MQ | 4475 | N730MQ | LGA | RDU | NaN | 431 | 13 | 25 | 2013-01-31T18:00:00Z |
| 27000 | 2013 | 1 | 31 | NaN | 1200 | NaN | NaN | 1430 | NaN | MQ | 4658 | N505MQ | LGA | ATL | NaN | 762 | 12 | 0 | 2013-01-31T17:00:00Z |
| 27001 | 2013 | 1 | 31 | NaN | 1410 | NaN | NaN | 1555 | NaN | MQ | 4491 | N734MQ | LGA | CLE | NaN | 419 | 14 | 10 | 2013-01-31T19:00:00Z |
| 27002 | 2013 | 1 | 31 | NaN | 1446 | NaN | NaN | 1757 | NaN | UA | 337 | NaN | LGA | IAH | NaN | 1416 | 14 | 46 | 2013-01-31T19:00:00Z |
| 27003 | 2013 | 1 | 31 | NaN | 625 | NaN | NaN | 934 | NaN | UA | 1497 | NaN | LGA | IAH | NaN | 1416 | 6 | 25 | 2013-01-31T11:00:00Z |
27004 rows × 19 columns
flights[flights['month']==1]
#flights[flights.month==1]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 26999 | 2013 | 1 | 31 | NaN | 1325 | NaN | NaN | 1505 | NaN | MQ | 4475 | N730MQ | LGA | RDU | NaN | 431 | 13 | 25 | 2013-01-31T18:00:00Z |
| 27000 | 2013 | 1 | 31 | NaN | 1200 | NaN | NaN | 1430 | NaN | MQ | 4658 | N505MQ | LGA | ATL | NaN | 762 | 12 | 0 | 2013-01-31T17:00:00Z |
| 27001 | 2013 | 1 | 31 | NaN | 1410 | NaN | NaN | 1555 | NaN | MQ | 4491 | N734MQ | LGA | CLE | NaN | 419 | 14 | 10 | 2013-01-31T19:00:00Z |
| 27002 | 2013 | 1 | 31 | NaN | 1446 | NaN | NaN | 1757 | NaN | UA | 337 | NaN | LGA | IAH | NaN | 1416 | 14 | 46 | 2013-01-31T19:00:00Z |
| 27003 | 2013 | 1 | 31 | NaN | 625 | NaN | NaN | 934 | NaN | UA | 1497 | NaN | LGA | IAH | NaN | 1416 | 6 | 25 | 2013-01-31T11:00:00Z |
27004 rows × 19 columns
# Select all flights on January 1st:
flights[(flights.month==1) & (flights.day==1)]
#flights[(flights['month']==1) & (flights['day']==1)]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 837 | 2013 | 1 | 1 | 2356.0 | 2359 | -3.0 | 425.0 | 437 | -12.0 | B6 | 727 | N588JB | JFK | BQN | 186.0 | 1576 | 23 | 59 | 2013-01-02T04:00:00Z |
| 838 | 2013 | 1 | 1 | NaN | 1630 | NaN | NaN | 1815 | NaN | EV | 4308 | N18120 | EWR | RDU | NaN | 416 | 16 | 30 | 2013-01-01T21:00:00Z |
| 839 | 2013 | 1 | 1 | NaN | 1935 | NaN | NaN | 2240 | NaN | AA | 791 | N3EHAA | LGA | DFW | NaN | 1389 | 19 | 35 | 2013-01-02T00:00:00Z |
| 840 | 2013 | 1 | 1 | NaN | 1500 | NaN | NaN | 1825 | NaN | AA | 1925 | N3EVAA | LGA | MIA | NaN | 1096 | 15 | 0 | 2013-01-01T20:00:00Z |
| 841 | 2013 | 1 | 1 | NaN | 600 | NaN | NaN | 901 | NaN | B6 | 125 | N618JB | JFK | FLL | NaN | 1069 | 6 | 0 | 2013-01-01T11:00:00Z |
842 rows × 19 columns
# Save the subset to a new dataframe
flights_0101 = flights[(flights.month==1) & (flights.day==1)]
flights_0101
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 837 | 2013 | 1 | 1 | 2356.0 | 2359 | -3.0 | 425.0 | 437 | -12.0 | B6 | 727 | N588JB | JFK | BQN | 186.0 | 1576 | 23 | 59 | 2013-01-02T04:00:00Z |
| 838 | 2013 | 1 | 1 | NaN | 1630 | NaN | NaN | 1815 | NaN | EV | 4308 | N18120 | EWR | RDU | NaN | 416 | 16 | 30 | 2013-01-01T21:00:00Z |
| 839 | 2013 | 1 | 1 | NaN | 1935 | NaN | NaN | 2240 | NaN | AA | 791 | N3EHAA | LGA | DFW | NaN | 1389 | 19 | 35 | 2013-01-02T00:00:00Z |
| 840 | 2013 | 1 | 1 | NaN | 1500 | NaN | NaN | 1825 | NaN | AA | 1925 | N3EVAA | LGA | MIA | NaN | 1096 | 15 | 0 | 2013-01-01T20:00:00Z |
| 841 | 2013 | 1 | 1 | NaN | 600 | NaN | NaN | 901 | NaN | B6 | 125 | N618JB | JFK | FLL | NaN | 1069 | 6 | 0 | 2013-01-01T11:00:00Z |
842 rows × 19 columns
# Select all flights scheduled to depart before 6:00 am.
flights[flights.sched_dep_time<=600]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335808 | 2013 | 9 | 30 | 601.0 | 600 | 1.0 | 839.0 | 905 | -26.0 | AA | 1175 | N3FEAA | LGA | MIA | 140.0 | 1096 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335810 | 2013 | 9 | 30 | 603.0 | 600 | 3.0 | 705.0 | 730 | -25.0 | UA | 279 | N457UA | EWR | ORD | 103.0 | 719 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335814 | 2013 | 9 | 30 | 609.0 | 600 | 9.0 | 834.0 | 815 | 19.0 | FL | 345 | N261AT | LGA | ATL | 111.0 | 762 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335842 | 2013 | 9 | 30 | 632.0 | 600 | 32.0 | 734.0 | 701 | 33.0 | US | 2134 | N748UW | LGA | BOS | 35.0 | 184 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335896 | 2013 | 9 | 30 | 724.0 | 600 | 84.0 | 946.0 | 840 | 66.0 | B6 | 27 | N558JB | EWR | MCO | 126.0 | 937 | 6 | 0 | 2013-09-30T10:00:00Z |
8970 rows × 19 columns
# Use query() function
flights.query('sched_dep_time<=600')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335808 | 2013 | 9 | 30 | 601.0 | 600 | 1.0 | 839.0 | 905 | -26.0 | AA | 1175 | N3FEAA | LGA | MIA | 140.0 | 1096 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335810 | 2013 | 9 | 30 | 603.0 | 600 | 3.0 | 705.0 | 730 | -25.0 | UA | 279 | N457UA | EWR | ORD | 103.0 | 719 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335814 | 2013 | 9 | 30 | 609.0 | 600 | 9.0 | 834.0 | 815 | 19.0 | FL | 345 | N261AT | LGA | ATL | 111.0 | 762 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335842 | 2013 | 9 | 30 | 632.0 | 600 | 32.0 | 734.0 | 701 | 33.0 | US | 2134 | N748UW | LGA | BOS | 35.0 | 184 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335896 | 2013 | 9 | 30 | 724.0 | 600 | 84.0 | 946.0 | 840 | 66.0 | B6 | 27 | N558JB | EWR | MCO | 126.0 | 937 | 6 | 0 | 2013-09-30T10:00:00Z |
8970 rows × 19 columns
flights.query('month==1 and day==2 and sched_dep_time<=600')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 844 | 2013 | 1 | 2 | 458.0 | 500 | -2.0 | 703.0 | 650 | 13.0 | US | 1030 | N162UW | EWR | CLT | 108.0 | 529 | 5 | 0 | 2013-01-02T10:00:00Z |
| 845 | 2013 | 1 | 2 | 512.0 | 515 | -3.0 | 809.0 | 819 | -10.0 | UA | 1453 | N76515 | EWR | IAH | 214.0 | 1400 | 5 | 15 | 2013-01-02T10:00:00Z |
| 846 | 2013 | 1 | 2 | 535.0 | 540 | -5.0 | 831.0 | 850 | -19.0 | AA | 1141 | N621AA | JFK | MIA | 156.0 | 1089 | 5 | 40 | 2013-01-02T10:00:00Z |
| 847 | 2013 | 1 | 2 | 536.0 | 529 | 7.0 | 840.0 | 828 | 12.0 | UA | 407 | N493UA | LGA | IAH | 231.0 | 1416 | 5 | 29 | 2013-01-02T10:00:00Z |
| 848 | 2013 | 1 | 2 | 539.0 | 545 | -6.0 | 959.0 | 1022 | -23.0 | B6 | 725 | N624JB | JFK | BQN | 184.0 | 1576 | 5 | 45 | 2013-01-02T10:00:00Z |
| 849 | 2013 | 1 | 2 | 554.0 | 600 | -6.0 | 845.0 | 901 | -16.0 | B6 | 125 | N637JB | JFK | FLL | 156.0 | 1069 | 6 | 0 | 2013-01-02T11:00:00Z |
| 850 | 2013 | 1 | 2 | 554.0 | 600 | -6.0 | 841.0 | 851 | -10.0 | B6 | 49 | N658JB | JFK | PBI | 146.0 | 1028 | 6 | 0 | 2013-01-02T11:00:00Z |
| 851 | 2013 | 1 | 2 | 554.0 | 600 | -6.0 | 909.0 | 858 | 11.0 | B6 | 371 | N805JB | LGA | FLL | 159.0 | 1076 | 6 | 0 | 2013-01-02T11:00:00Z |
| 852 | 2013 | 1 | 2 | 555.0 | 600 | -5.0 | 931.0 | 910 | 21.0 | AA | 707 | N3BEAA | LGA | DFW | 255.0 | 1389 | 6 | 0 | 2013-01-02T11:00:00Z |
| 853 | 2013 | 1 | 2 | 555.0 | 600 | -5.0 | 856.0 | 856 | 0.0 | B6 | 71 | N806JB | JFK | TPA | 158.0 | 1005 | 6 | 0 | 2013-01-02T11:00:00Z |
| 854 | 2013 | 1 | 2 | 555.0 | 600 | -5.0 | 750.0 | 757 | -7.0 | DL | 731 | N366NB | LGA | DTW | 87.0 | 502 | 6 | 0 | 2013-01-02T11:00:00Z |
| 855 | 2013 | 1 | 2 | 556.0 | 600 | -4.0 | 724.0 | 723 | 1.0 | EV | 5708 | N836AS | LGA | IAD | 54.0 | 229 | 6 | 0 | 2013-01-02T11:00:00Z |
| 856 | 2013 | 1 | 2 | 556.0 | 600 | -4.0 | 837.0 | 837 | 0.0 | DL | 461 | N618DL | LGA | ATL | 128.0 | 762 | 6 | 0 | 2013-01-02T11:00:00Z |
| 858 | 2013 | 1 | 2 | 558.0 | 600 | -2.0 | 838.0 | 815 | 23.0 | FL | 345 | N896AT | LGA | ATL | 129.0 | 762 | 6 | 0 | 2013-01-02T11:00:00Z |
| 859 | 2013 | 1 | 2 | 558.0 | 600 | -2.0 | 916.0 | 931 | -15.0 | UA | 303 | N505UA | JFK | SFO | 341.0 | 2586 | 6 | 0 | 2013-01-02T11:00:00Z |
| 861 | 2013 | 1 | 2 | 559.0 | 600 | -1.0 | 906.0 | 907 | -1.0 | UA | 1077 | N12225 | EWR | MIA | 157.0 | 1085 | 6 | 0 | 2013-01-02T11:00:00Z |
| 862 | 2013 | 1 | 2 | 600.0 | 600 | 0.0 | 814.0 | 749 | 25.0 | EV | 4334 | N13914 | EWR | CMH | 98.0 | 463 | 6 | 0 | 2013-01-02T11:00:00Z |
| 864 | 2013 | 1 | 2 | 600.0 | 600 | 0.0 | 819.0 | 815 | 4.0 | 9E | 4171 | N8946A | EWR | CVG | 120.0 | 569 | 6 | 0 | 2013-01-02T11:00:00Z |
| 865 | 2013 | 1 | 2 | 600.0 | 600 | 0.0 | 846.0 | 846 | 0.0 | B6 | 79 | N529JB | JFK | MCO | 140.0 | 944 | 6 | 0 | 2013-01-02T11:00:00Z |
| 866 | 2013 | 1 | 2 | 600.0 | 600 | 0.0 | 737.0 | 725 | 12.0 | WN | 3136 | N8311Q | LGA | MDW | 117.0 | 725 | 6 | 0 | 2013-01-02T11:00:00Z |
| 868 | 2013 | 1 | 2 | 600.0 | 600 | 0.0 | 747.0 | 735 | 12.0 | UA | 1280 | N62631 | LGA | ORD | 125.0 | 733 | 6 | 0 | 2013-01-02T11:00:00Z |
| 869 | 2013 | 1 | 2 | 602.0 | 600 | 2.0 | 646.0 | 659 | -13.0 | US | 1833 | N951UW | LGA | PHL | 28.0 | 96 | 6 | 0 | 2013-01-02T11:00:00Z |
| 870 | 2013 | 1 | 2 | 603.0 | 600 | 3.0 | 733.0 | 745 | -12.0 | AA | 301 | N3CRAA | LGA | ORD | 118.0 | 733 | 6 | 0 | 2013-01-02T11:00:00Z |
| 871 | 2013 | 1 | 2 | 603.0 | 559 | 4.0 | 912.0 | 916 | -4.0 | UA | 1676 | N17229 | EWR | LAX | 341.0 | 2454 | 5 | 59 | 2013-01-02T10:00:00Z |
| 872 | 2013 | 1 | 2 | 605.0 | 600 | 5.0 | 851.0 | 935 | -44.0 | UA | 421 | N832UA | EWR | SFO | 329.0 | 2565 | 6 | 0 | 2013-01-02T11:00:00Z |
| 876 | 2013 | 1 | 2 | 609.0 | 600 | 9.0 | 909.0 | 854 | 15.0 | B6 | 507 | N630JB | EWR | FLL | 158.0 | 1065 | 6 | 0 | 2013-01-02T11:00:00Z |
| 877 | 2013 | 1 | 2 | 610.0 | 600 | 10.0 | 826.0 | 807 | 19.0 | EV | 5310 | N740EV | LGA | MEM | 172.0 | 963 | 6 | 0 | 2013-01-02T11:00:00Z |
| 879 | 2013 | 1 | 2 | 611.0 | 600 | 11.0 | 756.0 | 725 | 31.0 | WN | 1563 | N235WN | EWR | MDW | 139.0 | 711 | 6 | 0 | 2013-01-02T11:00:00Z |
| 880 | 2013 | 1 | 2 | 612.0 | 600 | 12.0 | 901.0 | 850 | 11.0 | B6 | 343 | N579JB | EWR | PBI | 146.0 | 1023 | 6 | 0 | 2013-01-02T11:00:00Z |
| 882 | 2013 | 1 | 2 | 616.0 | 600 | 16.0 | 1001.0 | 917 | 44.0 | UA | 1141 | N19141 | JFK | LAX | 354.0 | 2475 | 6 | 0 | 2013-01-02T11:00:00Z |
| 886 | 2013 | 1 | 2 | 624.0 | 600 | 24.0 | 908.0 | 825 | 43.0 | MQ | 4650 | N513MQ | LGA | ATL | 138.0 | 762 | 6 | 0 | 2013-01-02T11:00:00Z |
| 948 | 2013 | 1 | 2 | 720.0 | 600 | 80.0 | 905.0 | 735 | 90.0 | MQ | 3768 | N520MQ | EWR | ORD | 112.0 | 719 | 6 | 0 | 2013-01-02T11:00:00Z |
| 1032 | 2013 | 1 | 2 | 833.0 | 558 | 155.0 | 1018.0 | 727 | 171.0 | UA | 651 | N448UA | EWR | ORD | 129.0 | 719 | 5 | 58 | 2013-01-02T10:00:00Z |
As shown above, multiple filtering conditions are combined with “&”: every condition must be true in order for a row to be included in the output.
For other types of combinations, you’ll need to use Boolean operators yourself: & is “and”, | is “or”, and ~ is “not”.
# Select flights in either Janurary or Feburary
flights[(flights.month==1) | (flights.month==2)]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 136242 | 2013 | 2 | 28 | NaN | 850 | NaN | NaN | 1035 | NaN | MQ | 4558 | N737MQ | LGA | CLE | NaN | 419 | 8 | 50 | 2013-02-28T13:00:00Z |
| 136243 | 2013 | 2 | 28 | NaN | 905 | NaN | NaN | 1115 | NaN | MQ | 4478 | N722MQ | LGA | DTW | NaN | 502 | 9 | 5 | 2013-02-28T14:00:00Z |
| 136244 | 2013 | 2 | 28 | NaN | 1115 | NaN | NaN | 1310 | NaN | MQ | 4485 | N725MQ | LGA | CMH | NaN | 479 | 11 | 15 | 2013-02-28T16:00:00Z |
| 136245 | 2013 | 2 | 28 | NaN | 830 | NaN | NaN | 1205 | NaN | UA | 1480 | NaN | EWR | SFO | NaN | 2565 | 8 | 30 | 2013-02-28T13:00:00Z |
| 136246 | 2013 | 2 | 28 | NaN | 840 | NaN | NaN | 1147 | NaN | UA | 443 | NaN | JFK | LAX | NaN | 2475 | 8 | 40 | 2013-02-28T13:00:00Z |
51955 rows × 19 columns
# Select flights in the second quarter
flights[flights.month.isin([4,5,6])]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 165081 | 2013 | 4 | 1 | 454.0 | 500 | -6.0 | 636.0 | 640 | -4.0 | US | 1843 | N566UW | EWR | CLT | 84.0 | 529 | 5 | 0 | 2013-04-01T09:00:00Z |
| 165082 | 2013 | 4 | 1 | 509.0 | 515 | -6.0 | 743.0 | 814 | -31.0 | UA | 1545 | N76288 | EWR | IAH | 194.0 | 1400 | 5 | 15 | 2013-04-01T09:00:00Z |
| 165083 | 2013 | 4 | 1 | 526.0 | 530 | -4.0 | 812.0 | 827 | -15.0 | UA | 1714 | N76517 | LGA | IAH | 206.0 | 1416 | 5 | 30 | 2013-04-01T09:00:00Z |
| 165084 | 2013 | 4 | 1 | 534.0 | 540 | -6.0 | 833.0 | 850 | -17.0 | AA | 1141 | N5DSAA | JFK | MIA | 152.0 | 1089 | 5 | 40 | 2013-04-01T09:00:00Z |
| 165085 | 2013 | 4 | 1 | 542.0 | 545 | -3.0 | 914.0 | 920 | -6.0 | B6 | 725 | N784JB | JFK | BQN | 191.0 | 1576 | 5 | 45 | 2013-04-01T09:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 250445 | 2013 | 6 | 30 | NaN | 1945 | NaN | NaN | 2104 | NaN | EV | 5714 | N836AS | JFK | IAD | NaN | 228 | 19 | 45 | 2013-06-30T23:00:00Z |
| 250446 | 2013 | 6 | 30 | NaN | 1610 | NaN | NaN | 1805 | NaN | EV | 4092 | N16147 | EWR | DAY | NaN | 533 | 16 | 10 | 2013-06-30T20:00:00Z |
| 250447 | 2013 | 6 | 30 | NaN | 1709 | NaN | NaN | 1856 | NaN | EV | 4662 | N16911 | EWR | RDU | NaN | 416 | 17 | 9 | 2013-06-30T21:00:00Z |
| 250448 | 2013 | 6 | 30 | NaN | 2059 | NaN | NaN | 2307 | NaN | EV | 5254 | N760EV | LGA | DSM | NaN | 1031 | 20 | 59 | 2013-07-01T00:00:00Z |
| 250449 | 2013 | 6 | 30 | NaN | 1915 | NaN | NaN | 2131 | NaN | EV | 5268 | N744EV | LGA | CLT | NaN | 544 | 19 | 15 | 2013-06-30T23:00:00Z |
85369 rows × 19 columns
# Select flights that are not in January
flights[flights.month!=1]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27004 | 2013 | 10 | 1 | 447.0 | 500 | -13.0 | 614.0 | 648 | -34.0 | US | 1877 | N538UW | EWR | CLT | 69.0 | 529 | 5 | 0 | 2013-10-01T09:00:00Z |
| 27005 | 2013 | 10 | 1 | 522.0 | 517 | 5.0 | 735.0 | 757 | -22.0 | UA | 252 | N556UA | EWR | IAH | 174.0 | 1400 | 5 | 17 | 2013-10-01T09:00:00Z |
| 27006 | 2013 | 10 | 1 | 536.0 | 545 | -9.0 | 809.0 | 855 | -46.0 | AA | 2243 | N630AA | JFK | MIA | 132.0 | 1089 | 5 | 45 | 2013-10-01T09:00:00Z |
| 27007 | 2013 | 10 | 1 | 539.0 | 545 | -6.0 | 801.0 | 827 | -26.0 | UA | 1714 | N37252 | LGA | IAH | 172.0 | 1416 | 5 | 45 | 2013-10-01T09:00:00Z |
| 27008 | 2013 | 10 | 1 | 539.0 | 545 | -6.0 | 917.0 | 933 | -16.0 | B6 | 1403 | N789JB | JFK | SJU | 186.0 | 1598 | 5 | 45 | 2013-10-01T09:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
309772 rows × 19 columns
# Select flights that are not in January, Feburary, or March
flights[(flights.month!=1) & (flights.month!=2) & (flights.month!=3)]
#flights[~flights.month.isin([1,2,3])]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27004 | 2013 | 10 | 1 | 447.0 | 500 | -13.0 | 614.0 | 648 | -34.0 | US | 1877 | N538UW | EWR | CLT | 69.0 | 529 | 5 | 0 | 2013-10-01T09:00:00Z |
| 27005 | 2013 | 10 | 1 | 522.0 | 517 | 5.0 | 735.0 | 757 | -22.0 | UA | 252 | N556UA | EWR | IAH | 174.0 | 1400 | 5 | 17 | 2013-10-01T09:00:00Z |
| 27006 | 2013 | 10 | 1 | 536.0 | 545 | -9.0 | 809.0 | 855 | -46.0 | AA | 2243 | N630AA | JFK | MIA | 132.0 | 1089 | 5 | 45 | 2013-10-01T09:00:00Z |
| 27007 | 2013 | 10 | 1 | 539.0 | 545 | -6.0 | 801.0 | 827 | -26.0 | UA | 1714 | N37252 | LGA | IAH | 172.0 | 1416 | 5 | 45 | 2013-10-01T09:00:00Z |
| 27008 | 2013 | 10 | 1 | 539.0 | 545 | -6.0 | 917.0 | 933 | -16.0 | B6 | 1403 | N789JB | JFK | SJU | 186.0 | 1598 | 5 | 45 | 2013-10-01T09:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
255987 rows × 19 columns
# Use query() function
flights.query('month>=1 and month<=3')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 165076 | 2013 | 3 | 31 | 2349.0 | 2355 | -6.0 | 333.0 | 338 | -5.0 | B6 | 707 | N657JB | JFK | SJU | 202.0 | 1598 | 23 | 55 | 2013-04-01T03:00:00Z |
| 165077 | 2013 | 3 | 31 | 2358.0 | 2359 | -1.0 | 332.0 | 339 | -7.0 | B6 | 727 | N608JB | JFK | BQN | 195.0 | 1576 | 23 | 59 | 2013-04-01T03:00:00Z |
| 165078 | 2013 | 3 | 31 | NaN | 1627 | NaN | NaN | 1734 | NaN | EV | 4299 | N17560 | EWR | DCA | NaN | 199 | 16 | 27 | 2013-03-31T20:00:00Z |
| 165079 | 2013 | 3 | 31 | NaN | 600 | NaN | NaN | 725 | NaN | EV | 5689 | N829AS | LGA | IAD | NaN | 229 | 6 | 0 | 2013-03-31T10:00:00Z |
| 165080 | 2013 | 3 | 31 | NaN | 929 | NaN | NaN | 1220 | NaN | UA | 1597 | NaN | EWR | EGE | NaN | 1725 | 9 | 29 | 2013-03-31T13:00:00Z |
80789 rows × 19 columns
It is quite common to have missing values or NaN's in data frames. NaN represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
In Python, if you want to determine if a value is missing, use .isnull():
flights[flights.arr_time.isnull()]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 754 | 2013 | 1 | 1 | 2016.0 | 1930 | 46.0 | NaN | 2220 | NaN | EV | 4204 | N14168 | EWR | OKC | NaN | 1325 | 19 | 30 | 2013-01-02T00:00:00Z |
| 838 | 2013 | 1 | 1 | NaN | 1630 | NaN | NaN | 1815 | NaN | EV | 4308 | N18120 | EWR | RDU | NaN | 416 | 16 | 30 | 2013-01-01T21:00:00Z |
| 839 | 2013 | 1 | 1 | NaN | 1935 | NaN | NaN | 2240 | NaN | AA | 791 | N3EHAA | LGA | DFW | NaN | 1389 | 19 | 35 | 2013-01-02T00:00:00Z |
| 840 | 2013 | 1 | 1 | NaN | 1500 | NaN | NaN | 1825 | NaN | AA | 1925 | N3EVAA | LGA | MIA | NaN | 1096 | 15 | 0 | 2013-01-01T20:00:00Z |
| 841 | 2013 | 1 | 1 | NaN | 600 | NaN | NaN | 901 | NaN | B6 | 125 | N618JB | JFK | FLL | NaN | 1069 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
8713 rows × 19 columns
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336776 entries, 0 to 336775 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 336776 non-null int64 1 month 336776 non-null int64 2 day 336776 non-null int64 3 dep_time 328521 non-null float64 4 sched_dep_time 336776 non-null int64 5 dep_delay 328521 non-null float64 6 arr_time 328063 non-null float64 7 sched_arr_time 336776 non-null int64 8 arr_delay 327346 non-null float64 9 carrier 336776 non-null object 10 flight 336776 non-null int64 11 tailnum 334264 non-null object 12 origin 336776 non-null object 13 dest 336776 non-null object 14 air_time 327346 non-null float64 15 distance 336776 non-null int64 16 hour 336776 non-null int64 17 minute 336776 non-null int64 18 time_hour 336776 non-null object dtypes: float64(5), int64(9), object(5) memory usage: 48.8+ MB
Given a data frame, we often want to sort the rows by a column name, or a set of column names, or more complicated expressions. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
# Order rows by month
flights.sort_values('month')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 18009 | 2013 | 1 | 21 | 1754.0 | 1800 | -6.0 | 1903.0 | 1915 | -12.0 | B6 | 1016 | N184JB | JFK | BOS | 44.0 | 187 | 18 | 0 | 2013-01-21T23:00:00Z |
| 18008 | 2013 | 1 | 21 | 1753.0 | 1800 | -7.0 | 1859.0 | 1913 | -14.0 | US | 2185 | N737US | LGA | DCA | 54.0 | 214 | 18 | 0 | 2013-01-21T23:00:00Z |
| 18007 | 2013 | 1 | 21 | 1752.0 | 1800 | -8.0 | 1850.0 | 1913 | -23.0 | US | 2138 | N952UW | LGA | BOS | 42.0 | 184 | 18 | 0 | 2013-01-21T23:00:00Z |
| 18006 | 2013 | 1 | 21 | 1751.0 | 1753 | -2.0 | 2052.0 | 2105 | -13.0 | UA | 535 | N554UA | JFK | LAX | 336.0 | 2475 | 17 | 53 | 2013-01-21T22:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92533 | 2013 | 12 | 11 | 622.0 | 630 | -8.0 | 814.0 | 815 | -1.0 | AA | 303 | N3DEAA | LGA | ORD | 136.0 | 733 | 6 | 30 | 2013-12-11T11:00:00Z |
| 92532 | 2013 | 12 | 11 | 621.0 | 625 | -4.0 | 805.0 | 750 | 15.0 | WN | 1360 | N8321D | LGA | MDW | 134.0 | 725 | 6 | 25 | 2013-12-11T11:00:00Z |
| 92531 | 2013 | 12 | 11 | 620.0 | 630 | -10.0 | 940.0 | 938 | 2.0 | B6 | 929 | N595JB | JFK | RSW | 179.0 | 1074 | 6 | 30 | 2013-12-11T11:00:00Z |
| 92542 | 2013 | 12 | 11 | 631.0 | 635 | -4.0 | 948.0 | 943 | 5.0 | UA | 1299 | N17229 | EWR | RSW | 179.0 | 1068 | 6 | 35 | 2013-12-11T11:00:00Z |
| 109119 | 2013 | 12 | 29 | 1455.0 | 1500 | -5.0 | 1658.0 | 1656 | 2.0 | US | 721 | N174US | EWR | CLT | 96.0 | 529 | 15 | 0 | 2013-12-29T20:00:00Z |
336776 rows × 19 columns
# Order rows by month in descending order
flights.sort_values('month', ascending=False)
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 84862 | 2013 | 12 | 2 | 1713.0 | 1715 | -2.0 | 1856.0 | 1915 | -19.0 | AA | 199 | N3FWAA | JFK | ORD | 120.0 | 740 | 17 | 15 | 2013-12-02T22:00:00Z |
| 93115 | 2013 | 12 | 11 | 1629.0 | 1459 | 90.0 | 1731.0 | 1625 | 66.0 | 9E | 2903 | N297PQ | JFK | BOS | 39.0 | 187 | 14 | 59 | 2013-12-11T19:00:00Z |
| 93104 | 2013 | 12 | 11 | 1622.0 | 1620 | 2.0 | 1848.0 | 1829 | 19.0 | EV | 4352 | N14953 | EWR | CVG | 122.0 | 569 | 16 | 20 | 2013-12-11T21:00:00Z |
| 93105 | 2013 | 12 | 11 | 1623.0 | 1630 | -7.0 | 1842.0 | 1845 | -3.0 | DL | 2231 | N944DL | LGA | DTW | 102.0 | 502 | 16 | 30 | 2013-12-11T21:00:00Z |
| 93106 | 2013 | 12 | 11 | 1623.0 | 1630 | -7.0 | 1756.0 | 1805 | -9.0 | EV | 5293 | N712EV | LGA | ORF | 63.0 | 296 | 16 | 30 | 2013-12-11T21:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 18007 | 2013 | 1 | 21 | 1752.0 | 1800 | -8.0 | 1850.0 | 1913 | -23.0 | US | 2138 | N952UW | LGA | BOS | 42.0 | 184 | 18 | 0 | 2013-01-21T23:00:00Z |
| 18008 | 2013 | 1 | 21 | 1753.0 | 1800 | -7.0 | 1859.0 | 1913 | -14.0 | US | 2185 | N737US | LGA | DCA | 54.0 | 214 | 18 | 0 | 2013-01-21T23:00:00Z |
| 18009 | 2013 | 1 | 21 | 1754.0 | 1800 | -6.0 | 1903.0 | 1915 | -12.0 | B6 | 1016 | N184JB | JFK | BOS | 44.0 | 187 | 18 | 0 | 2013-01-21T23:00:00Z |
| 18010 | 2013 | 1 | 21 | 1755.0 | 1800 | -5.0 | 2015.0 | 2006 | 9.0 | US | 373 | N657AW | JFK | CLT | 99.0 | 541 | 18 | 0 | 2013-01-21T23:00:00Z |
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
336776 rows × 19 columns
# Order rows by year, month, day
flights.sort_values(by=['year','month','day'])
# Or simply:
#flights.sort_values(['year','month','day'])
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 111291 | 2013 | 12 | 31 | NaN | 705 | NaN | NaN | 931 | NaN | UA | 1729 | NaN | EWR | DEN | NaN | 1605 | 7 | 5 | 2013-12-31T12:00:00Z |
| 111292 | 2013 | 12 | 31 | NaN | 825 | NaN | NaN | 1029 | NaN | US | 1831 | NaN | JFK | CLT | NaN | 541 | 8 | 25 | 2013-12-31T13:00:00Z |
| 111293 | 2013 | 12 | 31 | NaN | 1615 | NaN | NaN | 1800 | NaN | MQ | 3301 | N844MQ | LGA | RDU | NaN | 431 | 16 | 15 | 2013-12-31T21:00:00Z |
| 111294 | 2013 | 12 | 31 | NaN | 600 | NaN | NaN | 735 | NaN | UA | 219 | NaN | EWR | ORD | NaN | 719 | 6 | 0 | 2013-12-31T11:00:00Z |
| 111295 | 2013 | 12 | 31 | NaN | 830 | NaN | NaN | 1154 | NaN | UA | 443 | NaN | JFK | LAX | NaN | 2475 | 8 | 30 | 2013-12-31T13:00:00Z |
336776 rows × 19 columns
# You can specify different ascending arguments for different column names
flights.sort_values(['month', 'day'], ascending=[True, False])
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26076 | 2013 | 1 | 31 | 1.0 | 2100 | 181.0 | 124.0 | 2225 | 179.0 | WN | 530 | N550WN | LGA | MDW | 127.0 | 725 | 21 | 0 | 2013-02-01T02:00:00Z |
| 26077 | 2013 | 1 | 31 | 4.0 | 2359 | 5.0 | 455.0 | 444 | 11.0 | B6 | 739 | N599JB | JFK | PSE | 206.0 | 1617 | 23 | 59 | 2013-02-01T04:00:00Z |
| 26078 | 2013 | 1 | 31 | 7.0 | 2359 | 8.0 | 453.0 | 437 | 16.0 | B6 | 727 | N505JB | JFK | BQN | 197.0 | 1576 | 23 | 59 | 2013-02-01T04:00:00Z |
| 26079 | 2013 | 1 | 31 | 12.0 | 2250 | 82.0 | 132.0 | 7 | 85.0 | B6 | 30 | N178JB | JFK | ROC | 60.0 | 264 | 22 | 50 | 2013-02-01T03:00:00Z |
| 26080 | 2013 | 1 | 31 | 26.0 | 2154 | 152.0 | 328.0 | 50 | 158.0 | B6 | 515 | N663JB | EWR | FLL | 161.0 | 1065 | 21 | 54 | 2013-02-01T02:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 84143 | 2013 | 12 | 1 | NaN | 830 | NaN | NaN | 1039 | NaN | 9E | 3385 | NaN | EWR | MSP | NaN | 1008 | 8 | 30 | 2013-12-01T13:00:00Z |
| 84144 | 2013 | 12 | 1 | NaN | 2229 | NaN | NaN | 2343 | NaN | B6 | 234 | N192JB | JFK | BTV | NaN | 266 | 22 | 29 | 2013-12-02T03:00:00Z |
| 84145 | 2013 | 12 | 1 | NaN | 631 | NaN | NaN | 742 | NaN | EV | 4194 | N13975 | EWR | DCA | NaN | 199 | 6 | 31 | 2013-12-01T11:00:00Z |
| 84146 | 2013 | 12 | 1 | NaN | 620 | NaN | NaN | 826 | NaN | EV | 5178 | N614QX | EWR | MSP | NaN | 1008 | 6 | 20 | 2013-12-01T11:00:00Z |
| 84147 | 2013 | 12 | 1 | NaN | 700 | NaN | NaN | 834 | NaN | UA | 643 | NaN | EWR | ORD | NaN | 719 | 7 | 0 | 2013-12-01T12:00:00Z |
336776 rows × 19 columns
# By default, missing values (NAs) are always sorted at the end.
flights.sort_values('dep_delay')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89673 | 2013 | 12 | 7 | 2040.0 | 2123 | -43.0 | 40.0 | 2352 | 48.0 | B6 | 97 | N592JB | JFK | DEN | 265.0 | 1626 | 21 | 23 | 2013-12-08T02:00:00Z |
| 113633 | 2013 | 2 | 3 | 2022.0 | 2055 | -33.0 | 2240.0 | 2338 | -58.0 | DL | 1715 | N612DL | LGA | MSY | 162.0 | 1183 | 20 | 55 | 2013-02-04T01:00:00Z |
| 64501 | 2013 | 11 | 10 | 1408.0 | 1440 | -32.0 | 1549.0 | 1559 | -10.0 | EV | 5713 | N825AS | LGA | IAD | 52.0 | 229 | 14 | 40 | 2013-11-10T19:00:00Z |
| 9619 | 2013 | 1 | 11 | 1900.0 | 1930 | -30.0 | 2233.0 | 2243 | -10.0 | DL | 1435 | N934DL | LGA | TPA | 139.0 | 1010 | 19 | 30 | 2013-01-12T00:00:00Z |
| 24915 | 2013 | 1 | 29 | 1703.0 | 1730 | -27.0 | 1947.0 | 1957 | -10.0 | F9 | 837 | N208FR | LGA | DEN | 250.0 | 1620 | 17 | 30 | 2013-01-29T22:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
When you work with a dataset with hundreds or even thousands of variables, which is not uncommon, the first challenge is often narrowing in on the variables you’re actually interested in.
# Select one column
#flights['carrier']
flights.carrier
0 UA
1 UA
2 AA
3 B6
4 DL
..
336771 9E
336772 9E
336773 MQ
336774 MQ
336775 MQ
Name: carrier, Length: 336776, dtype: object
# Select multiple columns
flights[['year','month','day']]
| year | month | day | |
|---|---|---|---|
| 0 | 2013 | 1 | 1 |
| 1 | 2013 | 1 | 1 |
| 2 | 2013 | 1 | 1 |
| 3 | 2013 | 1 | 1 |
| 4 | 2013 | 1 | 1 |
| ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 |
| 336772 | 2013 | 9 | 30 |
| 336773 | 2013 | 9 | 30 |
| 336774 | 2013 | 9 | 30 |
| 336775 | 2013 | 9 | 30 |
336776 rows × 3 columns
Select columns whose name matches regular expression regex.
df.filter(regex='regex')
# Select all columns containing a '_' in the name.
flights.filter(regex='_')
| dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | air_time | time_hour | |
|---|---|---|---|---|---|---|---|---|
| 0 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | 227.0 | 2013-01-01T10:00:00Z |
| 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | 227.0 | 2013-01-01T10:00:00Z |
| 2 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | 160.0 | 2013-01-01T10:00:00Z |
| 3 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | 183.0 | 2013-01-01T10:00:00Z |
| 4 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | 116.0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | NaN | 1455 | NaN | NaN | 1634 | NaN | NaN | 2013-09-30T18:00:00Z |
| 336772 | NaN | 2200 | NaN | NaN | 2312 | NaN | NaN | 2013-10-01T02:00:00Z |
| 336773 | NaN | 1210 | NaN | NaN | 1330 | NaN | NaN | 2013-09-30T16:00:00Z |
| 336774 | NaN | 1159 | NaN | NaN | 1344 | NaN | NaN | 2013-09-30T15:00:00Z |
| 336775 | NaN | 840 | NaN | NaN | 1020 | NaN | NaN | 2013-09-30T12:00:00Z |
336776 rows × 8 columns
# Select all columns beginning with word 'dep'
flights.filter(regex='^dep')
| dep_time | dep_delay | |
|---|---|---|
| 0 | 517.0 | 2.0 |
| 1 | 533.0 | 4.0 |
| 2 | 542.0 | 2.0 |
| 3 | 544.0 | -1.0 |
| 4 | 554.0 | -6.0 |
| ... | ... | ... |
| 336771 | NaN | NaN |
| 336772 | NaN | NaN |
| 336773 | NaN | NaN |
| 336774 | NaN | NaN |
| 336775 | NaN | NaN |
336776 rows × 2 columns
# Select all columns endding with word 'time'
flights.filter(regex='time$')
| dep_time | sched_dep_time | arr_time | sched_arr_time | air_time | |
|---|---|---|---|---|---|
| 0 | 517.0 | 515 | 830.0 | 819 | 227.0 |
| 1 | 533.0 | 529 | 850.0 | 830 | 227.0 |
| 2 | 542.0 | 540 | 923.0 | 850 | 160.0 |
| 3 | 544.0 | 545 | 1004.0 | 1022 | 183.0 |
| 4 | 554.0 | 600 | 812.0 | 837 | 116.0 |
| ... | ... | ... | ... | ... | ... |
| 336771 | NaN | 1455 | NaN | 1634 | NaN |
| 336772 | NaN | 2200 | NaN | 2312 | NaN |
| 336773 | NaN | 1210 | NaN | 1330 | NaN |
| 336774 | NaN | 1159 | NaN | 1344 | NaN |
| 336775 | NaN | 840 | NaN | 1020 | NaN |
336776 rows × 5 columns
# Select all columns beginning with 'a', endding with 'e', and any string in between.
flights.filter(regex='^a.*e$')
| arr_time | air_time | |
|---|---|---|
| 0 | 830.0 | 227.0 |
| 1 | 850.0 | 227.0 |
| 2 | 923.0 | 160.0 |
| 3 | 1004.0 | 183.0 |
| 4 | 812.0 | 116.0 |
| ... | ... | ... |
| 336771 | NaN | NaN |
| 336772 | NaN | NaN |
| 336773 | NaN | NaN |
| 336774 | NaN | NaN |
| 336775 | NaN | NaN |
336776 rows × 2 columns
# Select all columns between 'carrier' and 'dest' (inclusive).
flights.loc[:,'carrier':'dest']
| carrier | flight | tailnum | origin | dest | |
|---|---|---|---|---|---|
| 0 | UA | 1545 | N14228 | EWR | IAH |
| 1 | UA | 1714 | N24211 | LGA | IAH |
| 2 | AA | 1141 | N619AA | JFK | MIA |
| 3 | B6 | 725 | N804JB | JFK | BQN |
| 4 | DL | 461 | N668DN | LGA | ATL |
| ... | ... | ... | ... | ... | ... |
| 336771 | 9E | 3393 | NaN | JFK | DCA |
| 336772 | 9E | 3525 | NaN | LGA | SYR |
| 336773 | MQ | 3461 | N535MQ | LGA | BNA |
| 336774 | MQ | 3572 | N511MQ | LGA | CLE |
| 336775 | MQ | 3531 | N839MQ | LGA | RDU |
336776 rows × 5 columns
# Select by column indexes:
# Select columns in positions 1, 2 and 5 (first column is 0).
flights.iloc[:,[1,2,5]]
| month | day | dep_delay | |
|---|---|---|---|
| 0 | 1 | 1 | 2.0 |
| 1 | 1 | 1 | 4.0 |
| 2 | 1 | 1 | 2.0 |
| 3 | 1 | 1 | -1.0 |
| 4 | 1 | 1 | -6.0 |
| ... | ... | ... | ... |
| 336771 | 9 | 30 | NaN |
| 336772 | 9 | 30 | NaN |
| 336773 | 9 | 30 | NaN |
| 336774 | 9 | 30 | NaN |
| 336775 | 9 | 30 | NaN |
336776 rows × 3 columns
# Select rows meeting logical condition, and only the specific columns.
# Select all flights in January, display the day, carrier, and flight:
flights.loc[flights['month']==1, ['day','carrier', 'flight']]
| day | carrier | flight | |
|---|---|---|---|
| 0 | 1 | UA | 1545 |
| 1 | 1 | UA | 1714 |
| 2 | 1 | AA | 1141 |
| 3 | 1 | B6 | 725 |
| 4 | 1 | DL | 461 |
| ... | ... | ... | ... |
| 26999 | 31 | MQ | 4475 |
| 27000 | 31 | MQ | 4658 |
| 27001 | 31 | MQ | 4491 |
| 27002 | 31 | UA | 337 |
| 27003 | 31 | UA | 1497 |
27004 rows × 3 columns
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns.
# First, let's create a small dataframe to work with
flights_sml = flights.filter(['year','month','day','dep_delay','arr_delay','distance','air_time'])
flights_sml
| year | month | day | dep_delay | arr_delay | distance | air_time | |
|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 2.0 | 11.0 | 1400 | 227.0 |
| 1 | 2013 | 1 | 1 | 4.0 | 20.0 | 1416 | 227.0 |
| 2 | 2013 | 1 | 1 | 2.0 | 33.0 | 1089 | 160.0 |
| 3 | 2013 | 1 | 1 | -1.0 | -18.0 | 1576 | 183.0 |
| 4 | 2013 | 1 | 1 | -6.0 | -25.0 | 762 | 116.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | NaN | 213 | NaN |
| 336772 | 2013 | 9 | 30 | NaN | NaN | 198 | NaN |
| 336773 | 2013 | 9 | 30 | NaN | NaN | 764 | NaN |
| 336774 | 2013 | 9 | 30 | NaN | NaN | 419 | NaN |
| 336775 | 2013 | 9 | 30 | NaN | NaN | 431 | NaN |
336776 rows × 7 columns
# Create two new variables one at a time
flights_sml['gain'] = flights_sml.dep_delay - flights_sml.arr_delay
flights_sml['speed'] = flights_sml.distance / flights_sml.air_time * 60
flights_sml.head()
| year | month | day | dep_delay | arr_delay | distance | air_time | gain | speed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 2.0 | 11.0 | 1400 | 227.0 | -9.0 | 370.044053 |
| 1 | 2013 | 1 | 1 | 4.0 | 20.0 | 1416 | 227.0 | -16.0 | 374.273128 |
| 2 | 2013 | 1 | 1 | 2.0 | 33.0 | 1089 | 160.0 | -31.0 | 408.375000 |
| 3 | 2013 | 1 | 1 | -1.0 | -18.0 | 1576 | 183.0 | 17.0 | 516.721311 |
| 4 | 2013 | 1 | 1 | -6.0 | -25.0 | 762 | 116.0 | 19.0 | 394.137931 |
# Remove existing columns from a dataframe
flights_sml.drop(columns=['gain','speed'])
| year | month | day | dep_delay | arr_delay | distance | air_time | |
|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 2.0 | 11.0 | 1400 | 227.0 |
| 1 | 2013 | 1 | 1 | 4.0 | 20.0 | 1416 | 227.0 |
| 2 | 2013 | 1 | 1 | 2.0 | 33.0 | 1089 | 160.0 |
| 3 | 2013 | 1 | 1 | -1.0 | -18.0 | 1576 | 183.0 |
| 4 | 2013 | 1 | 1 | -6.0 | -25.0 | 762 | 116.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | NaN | 213 | NaN |
| 336772 | 2013 | 9 | 30 | NaN | NaN | 198 | NaN |
| 336773 | 2013 | 9 | 30 | NaN | NaN | 764 | NaN |
| 336774 | 2013 | 9 | 30 | NaN | NaN | 419 | NaN |
| 336775 | 2013 | 9 | 30 | NaN | NaN | 431 | NaN |
336776 rows × 7 columns
# Create multiple new columns
flights_sml.assign(
gain = lambda x: x.dep_delay - x.arr_delay,
hours = lambda x: x.air_time / 60,
gain_per_hour = lambda x: x.gain / x.hours # Note that you can refer to columns that you’ve just created
)
| year | month | day | dep_delay | arr_delay | distance | air_time | gain | speed | hours | gain_per_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 2.0 | 11.0 | 1400 | 227.0 | -9.0 | 370.044053 | 3.783333 | -2.378855 |
| 1 | 2013 | 1 | 1 | 4.0 | 20.0 | 1416 | 227.0 | -16.0 | 374.273128 | 3.783333 | -4.229075 |
| 2 | 2013 | 1 | 1 | 2.0 | 33.0 | 1089 | 160.0 | -31.0 | 408.375000 | 2.666667 | -11.625000 |
| 3 | 2013 | 1 | 1 | -1.0 | -18.0 | 1576 | 183.0 | 17.0 | 516.721311 | 3.050000 | 5.573770 |
| 4 | 2013 | 1 | 1 | -6.0 | -25.0 | 762 | 116.0 | 19.0 | 394.137931 | 1.933333 | 9.827586 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | NaN | 213 | NaN | NaN | NaN | NaN | NaN |
| 336772 | 2013 | 9 | 30 | NaN | NaN | 198 | NaN | NaN | NaN | NaN | NaN |
| 336773 | 2013 | 9 | 30 | NaN | NaN | 764 | NaN | NaN | NaN | NaN | NaN |
| 336774 | 2013 | 9 | 30 | NaN | NaN | 419 | NaN | NaN | NaN | NaN | NaN |
| 336775 | 2013 | 9 | 30 | NaN | NaN | 431 | NaN | NaN | NaN | NaN | NaN |
336776 rows × 11 columns
There are many functions for creating new variables
flights_sml['air_time_hours'] = flights_sml.air_time // 60
flights_sml['log2_dist'] = np.log2(flights_sml.distance)
flights_sml['gain_pos'] = flights_sml.gain > 0
flights_sml['gain_cumsum'] = flights_sml.gain.cumsum()
flights_sml['dist_rank'] = flights_sml['distance'].rank(method='min',ascending=True)
flights_sml.head()
| year | month | day | dep_delay | arr_delay | distance | air_time | gain | speed | air_time_hours | log2_dist | gain_pos | gain_cumsum | dist_rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 2.0 | 11.0 | 1400 | 227.0 | -9.0 | 370.044053 | 3.0 | 10.451211 | False | -9.0 | 254751.0 |
| 1 | 2013 | 1 | 1 | 4.0 | 20.0 | 1416 | 227.0 | -16.0 | 374.273128 | 3.0 | 10.467606 | False | -25.0 | 259700.0 |
| 2 | 2013 | 1 | 1 | 2.0 | 33.0 | 1089 | 160.0 | -31.0 | 408.375000 | 2.0 | 10.088788 | False | -56.0 | 228548.0 |
| 3 | 2013 | 1 | 1 | -1.0 | -18.0 | 1576 | 183.0 | 17.0 | 516.721311 | 3.0 | 10.622052 | True | -39.0 | 266833.0 |
| 4 | 2013 | 1 | 1 | -6.0 | -25.0 | 762 | 116.0 | 19.0 | 394.137931 | 1.0 | 9.573647 | True | -20.0 | 149279.0 |
In this assignment, you will be working on the same dataframe of flights departing New York City in 2013.
import pandas as pd
# Install the package 'nycflights13' before you can run this
from nycflights13 import flights
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
flights.shape
(336776, 19)
Date of departure
Actual departure and arrival times (format HHMM or HMM), local tz.
Scheduled departure and arrival times (format HHMM or HMM), local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Time of scheduled departure broken into hour and minutes.
Two letter carrier abbreviation. See airlines() to get name
Plane tail number
Flight number
Origin and destination. See airports() for additional metadata.
Amount of time spent in the air, in minutes
Distance between airports, in miles
Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.
# use describe() to summarize all columns
flights.describe(include='all')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 336776.0 | 336776.000000 | 336776.000000 | 328521.000000 | 336776.000000 | 328521.000000 | 328063.000000 | 336776.000000 | 327346.000000 | 336776 | 336776.000000 | 334264 | 336776 | 336776 | 327346.000000 | 336776.000000 | 336776.000000 | 336776.000000 | 336776 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 16 | NaN | 4043 | 3 | 105 | NaN | NaN | NaN | NaN | 6936 |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UA | NaN | N725MQ | EWR | ORD | NaN | NaN | NaN | NaN | 2013-09-13T12:00:00Z |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58665 | NaN | 575 | 120835 | 17283 | NaN | NaN | NaN | NaN | 94 |
| mean | 2013.0 | 6.548510 | 15.710787 | 1349.109947 | 1344.254840 | 12.639070 | 1502.054999 | 1536.380220 | 6.895377 | NaN | 1971.923620 | NaN | NaN | NaN | 150.686460 | 1039.912604 | 13.180247 | 26.230100 | NaN |
| std | 0.0 | 3.414457 | 8.768607 | 488.281791 | 467.335756 | 40.210061 | 533.264132 | 497.457142 | 44.633292 | NaN | 1632.471938 | NaN | NaN | NaN | 93.688305 | 733.233033 | 4.661316 | 19.300846 | NaN |
| min | 2013.0 | 1.000000 | 1.000000 | 1.000000 | 106.000000 | -43.000000 | 1.000000 | 1.000000 | -86.000000 | NaN | 1.000000 | NaN | NaN | NaN | 20.000000 | 17.000000 | 1.000000 | 0.000000 | NaN |
| 25% | 2013.0 | 4.000000 | 8.000000 | 907.000000 | 906.000000 | -5.000000 | 1104.000000 | 1124.000000 | -17.000000 | NaN | 553.000000 | NaN | NaN | NaN | 82.000000 | 502.000000 | 9.000000 | 8.000000 | NaN |
| 50% | 2013.0 | 7.000000 | 16.000000 | 1401.000000 | 1359.000000 | -2.000000 | 1535.000000 | 1556.000000 | -5.000000 | NaN | 1496.000000 | NaN | NaN | NaN | 129.000000 | 872.000000 | 13.000000 | 29.000000 | NaN |
| 75% | 2013.0 | 10.000000 | 23.000000 | 1744.000000 | 1729.000000 | 11.000000 | 1940.000000 | 1945.000000 | 14.000000 | NaN | 3465.000000 | NaN | NaN | NaN | 192.000000 | 1389.000000 | 17.000000 | 44.000000 | NaN |
| max | 2013.0 | 12.000000 | 31.000000 | 2400.000000 | 2359.000000 | 1301.000000 | 2400.000000 | 2359.000000 | 1272.000000 | NaN | 8500.000000 | NaN | NaN | NaN | 695.000000 | 4983.000000 | 23.000000 | 59.000000 | NaN |
From the 'flights' dataframe, find all flights that satisfy the following certain conditions:
# Had an arrival delay of two or more hours
flights.loc[flights['arr_delay']>=120]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 119 | 2013 | 1 | 1 | 811.0 | 630 | 101.0 | 1047.0 | 830 | 137.0 | MQ | 4576 | N531MQ | LGA | CLT | 118.0 | 544 | 6 | 30 | 2013-01-01T11:00:00Z |
| 151 | 2013 | 1 | 1 | 848.0 | 1835 | 853.0 | 1001.0 | 1950 | 851.0 | MQ | 3944 | N942MQ | JFK | BWI | 41.0 | 184 | 18 | 35 | 2013-01-01T23:00:00Z |
| 218 | 2013 | 1 | 1 | 957.0 | 733 | 144.0 | 1056.0 | 853 | 123.0 | UA | 856 | N534UA | EWR | BOS | 37.0 | 200 | 7 | 33 | 2013-01-01T12:00:00Z |
| 268 | 2013 | 1 | 1 | 1114.0 | 900 | 134.0 | 1447.0 | 1222 | 145.0 | UA | 1086 | N76502 | LGA | IAH | 248.0 | 1416 | 9 | 0 | 2013-01-01T14:00:00Z |
| 447 | 2013 | 1 | 1 | 1505.0 | 1310 | 115.0 | 1638.0 | 1431 | 127.0 | EV | 4497 | N17984 | EWR | RIC | 63.0 | 277 | 13 | 10 | 2013-01-01T18:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336579 | 2013 | 9 | 30 | 1823.0 | 1545 | 158.0 | 1934.0 | 1733 | 121.0 | 9E | 3459 | N916XJ | JFK | BNA | 95.0 | 765 | 15 | 45 | 2013-09-30T19:00:00Z |
| 336668 | 2013 | 9 | 30 | 1951.0 | 1649 | 182.0 | 2157.0 | 1903 | 174.0 | EV | 4294 | N13988 | EWR | SAV | 95.0 | 708 | 16 | 49 | 2013-09-30T20:00:00Z |
| 336724 | 2013 | 9 | 30 | 2053.0 | 1815 | 158.0 | 2310.0 | 2054 | 136.0 | EV | 5292 | N600QX | EWR | ATL | 91.0 | 746 | 18 | 15 | 2013-09-30T22:00:00Z |
| 336757 | 2013 | 9 | 30 | 2159.0 | 1845 | 194.0 | 2344.0 | 2030 | 194.0 | 9E | 3320 | N906XJ | JFK | BUF | 50.0 | 301 | 18 | 45 | 2013-09-30T22:00:00Z |
| 336763 | 2013 | 9 | 30 | 2235.0 | 2001 | 154.0 | 59.0 | 2249 | 130.0 | B6 | 1083 | N804JB | JFK | MCO | 123.0 | 944 | 20 | 1 | 2013-10-01T00:00:00Z |
10200 rows × 19 columns
# Flew to Houston (IAH or HOU)
flights[(flights.dest=='IAH') | (flights.dest=='HOU')]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 32 | 2013 | 1 | 1 | 623.0 | 627 | -4.0 | 933.0 | 932 | 1.0 | UA | 496 | N459UA | LGA | IAH | 229.0 | 1416 | 6 | 27 | 2013-01-01T11:00:00Z |
| 81 | 2013 | 1 | 1 | 728.0 | 732 | -4.0 | 1041.0 | 1038 | 3.0 | UA | 473 | N488UA | LGA | IAH | 238.0 | 1416 | 7 | 32 | 2013-01-01T12:00:00Z |
| 89 | 2013 | 1 | 1 | 739.0 | 739 | 0.0 | 1104.0 | 1038 | 26.0 | UA | 1479 | N37408 | EWR | IAH | 249.0 | 1400 | 7 | 39 | 2013-01-01T12:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336524 | 2013 | 9 | 30 | 1729.0 | 1720 | 9.0 | 2001.0 | 2010 | -9.0 | UA | 652 | N455UA | EWR | IAH | 173.0 | 1400 | 17 | 20 | 2013-09-30T21:00:00Z |
| 336527 | 2013 | 9 | 30 | 1735.0 | 1715 | 20.0 | 2010.0 | 2005 | 5.0 | WN | 2067 | N296WN | EWR | HOU | 188.0 | 1411 | 17 | 15 | 2013-09-30T21:00:00Z |
| 336618 | 2013 | 9 | 30 | 1859.0 | 1859 | 0.0 | 2134.0 | 2159 | -25.0 | UA | 1128 | N14731 | LGA | IAH | 180.0 | 1416 | 18 | 59 | 2013-09-30T22:00:00Z |
| 336694 | 2013 | 9 | 30 | 2015.0 | 2015 | 0.0 | 2244.0 | 2307 | -23.0 | UA | 1545 | N17730 | EWR | IAH | 174.0 | 1400 | 20 | 15 | 2013-10-01T00:00:00Z |
| 336737 | 2013 | 9 | 30 | 2105.0 | 2106 | -1.0 | 2329.0 | 2354 | -25.0 | UA | 475 | N477UA | EWR | IAH | 175.0 | 1400 | 21 | 6 | 2013-10-01T01:00:00Z |
9313 rows × 19 columns
# Were operated by United (UA), American (AA), or Delta (DL)
flights[(flights.carrier=='UA') | (flights.carrier=='AA') | (flights.carrier=='DL')]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| 5 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336737 | 2013 | 9 | 30 | 2105.0 | 2106 | -1.0 | 2329.0 | 2354 | -25.0 | UA | 475 | N477UA | EWR | IAH | 175.0 | 1400 | 21 | 6 | 2013-10-01T01:00:00Z |
| 336744 | 2013 | 9 | 30 | 2121.0 | 2100 | 21.0 | 2349.0 | 14 | -25.0 | DL | 2363 | N193DN | JFK | LAX | 296.0 | 2475 | 21 | 0 | 2013-10-01T01:00:00Z |
| 336751 | 2013 | 9 | 30 | 2140.0 | 2140 | 0.0 | 10.0 | 40 | -30.0 | AA | 185 | N335AA | JFK | LAX | 298.0 | 2475 | 21 | 40 | 2013-10-01T01:00:00Z |
| 336755 | 2013 | 9 | 30 | 2149.0 | 2156 | -7.0 | 2245.0 | 2308 | -23.0 | UA | 523 | N813UA | EWR | BOS | 37.0 | 200 | 21 | 56 | 2013-10-01T01:00:00Z |
| 336762 | 2013 | 9 | 30 | 2233.0 | 2113 | 80.0 | 112.0 | 30 | 42.0 | UA | 471 | N578UA | EWR | SFO | 318.0 | 2565 | 21 | 13 | 2013-10-01T01:00:00Z |
139504 rows × 19 columns
# Departed in July, August, and September
flights[flights.month.isin([7,8,9])]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 250450 | 2013 | 7 | 1 | 1.0 | 2029 | 212.0 | 236.0 | 2359 | 157.0 | B6 | 915 | N653JB | JFK | SFO | 315.0 | 2586 | 20 | 29 | 2013-07-02T00:00:00Z |
| 250451 | 2013 | 7 | 1 | 2.0 | 2359 | 3.0 | 344.0 | 344 | 0.0 | B6 | 1503 | N805JB | JFK | SJU | 200.0 | 1598 | 23 | 59 | 2013-07-02T03:00:00Z |
| 250452 | 2013 | 7 | 1 | 29.0 | 2245 | 104.0 | 151.0 | 1 | 110.0 | B6 | 234 | N348JB | JFK | BTV | 66.0 | 266 | 22 | 45 | 2013-07-02T02:00:00Z |
| 250453 | 2013 | 7 | 1 | 43.0 | 2130 | 193.0 | 322.0 | 14 | 188.0 | B6 | 1371 | N794JB | LGA | FLL | 143.0 | 1076 | 21 | 30 | 2013-07-02T01:00:00Z |
| 250454 | 2013 | 7 | 1 | 44.0 | 2150 | 174.0 | 300.0 | 100 | 120.0 | AA | 185 | N324AA | JFK | LAX | 297.0 | 2475 | 21 | 50 | 2013-07-02T01:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
86326 rows × 19 columns
# Arrived more than two hours late, but didn’t leave late
flights.loc[(flights['arr_delay']>120) & (flights['dep_delay']<0) ]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22911 | 2013 | 1 | 27 | 1419.0 | 1420 | -1.0 | 1754.0 | 1550 | 124.0 | MQ | 3728 | N1EAMQ | EWR | ORD | 135.0 | 719 | 14 | 20 | 2013-01-27T19:00:00Z |
| 33019 | 2013 | 10 | 7 | 1357.0 | 1359 | -2.0 | 1858.0 | 1654 | 124.0 | AA | 1151 | N3CMAA | LGA | DFW | 192.0 | 1389 | 13 | 59 | 2013-10-07T17:00:00Z |
| 41075 | 2013 | 10 | 16 | 657.0 | 700 | -3.0 | 1258.0 | 1056 | 122.0 | B6 | 3 | N703JB | JFK | SJU | 225.0 | 1598 | 7 | 0 | 2013-10-16T11:00:00Z |
| 55985 | 2013 | 11 | 1 | 658.0 | 700 | -2.0 | 1329.0 | 1015 | 194.0 | VX | 399 | N629VA | JFK | LAX | 336.0 | 2475 | 7 | 0 | 2013-11-01T11:00:00Z |
| 152766 | 2013 | 3 | 18 | 1844.0 | 1847 | -3.0 | 39.0 | 2219 | 140.0 | UA | 389 | N560UA | JFK | SFO | 386.0 | 2586 | 18 | 47 | 2013-03-18T22:00:00Z |
| 180893 | 2013 | 4 | 17 | 1635.0 | 1640 | -5.0 | 2049.0 | 1845 | 124.0 | MQ | 4540 | N721MQ | LGA | DTW | 130.0 | 502 | 16 | 40 | 2013-04-17T20:00:00Z |
| 181270 | 2013 | 4 | 18 | 558.0 | 600 | -2.0 | 1149.0 | 850 | 179.0 | AA | 707 | N3EXAA | LGA | DFW | 234.0 | 1389 | 6 | 0 | 2013-04-18T10:00:00Z |
| 181327 | 2013 | 4 | 18 | 655.0 | 700 | -5.0 | 1213.0 | 950 | 143.0 | AA | 2083 | N565AA | EWR | DFW | 230.0 | 1372 | 7 | 0 | 2013-04-18T11:00:00Z |
| 213693 | 2013 | 5 | 22 | 1827.0 | 1830 | -3.0 | 2217.0 | 2010 | 127.0 | MQ | 4674 | N518MQ | LGA | CLE | 90.0 | 419 | 18 | 30 | 2013-05-22T22:00:00Z |
| 226434 | 2013 | 6 | 5 | 1604.0 | 1615 | -11.0 | 2041.0 | 1840 | 121.0 | MQ | 4657 | N510MQ | LGA | ATL | 158.0 | 762 | 16 | 15 | 2013-06-05T20:00:00Z |
| 235033 | 2013 | 6 | 14 | 1708.0 | 1710 | -2.0 | 2227.0 | 2015 | 132.0 | AA | 181 | N320AA | JFK | LAX | 334.0 | 2475 | 17 | 10 | 2013-06-14T21:00:00Z |
| 244329 | 2013 | 6 | 24 | 1602.0 | 1605 | -3.0 | 2134.0 | 1916 | 138.0 | DL | 706 | N3768 | JFK | AUS | 247.0 | 1521 | 16 | 5 | 2013-06-24T20:00:00Z |
| 247568 | 2013 | 6 | 27 | 2052.0 | 2100 | -8.0 | 13.0 | 2210 | 123.0 | US | 2144 | N952UW | LGA | BOS | 46.0 | 184 | 21 | 0 | 2013-06-28T01:00:00Z |
| 249957 | 2013 | 6 | 30 | 1423.0 | 1425 | -2.0 | 1816.0 | 1554 | 142.0 | B6 | 2402 | N206JB | JFK | BUF | 80.0 | 301 | 14 | 25 | 2013-06-30T18:00:00Z |
| 256340 | 2013 | 7 | 7 | 1659.0 | 1700 | -1.0 | 2050.0 | 1823 | 147.0 | US | 2183 | N948UW | LGA | DCA | 64.0 | 214 | 17 | 0 | 2013-07-07T21:00:00Z |
| 256358 | 2013 | 7 | 7 | 1727.0 | 1730 | -3.0 | 2203.0 | 1951 | 132.0 | F9 | 837 | N263AV | LGA | DEN | 236.0 | 1620 | 17 | 30 | 2013-07-07T21:00:00Z |
| 256373 | 2013 | 7 | 7 | 1746.0 | 1755 | -9.0 | 2133.0 | 1921 | 132.0 | B6 | 1407 | N374JB | JFK | IAD | 78.0 | 228 | 17 | 55 | 2013-07-07T21:00:00Z |
| 256405 | 2013 | 7 | 7 | 1823.0 | 1830 | -7.0 | 2201.0 | 1955 | 126.0 | MQ | 3486 | N724MQ | LGA | BNA | 113.0 | 764 | 18 | 30 | 2013-07-07T22:00:00Z |
| 270742 | 2013 | 7 | 22 | 1555.0 | 1600 | -5.0 | 2139.0 | 1938 | 121.0 | DL | 141 | N713TW | JFK | SFO | 371.0 | 2586 | 16 | 0 | 2013-07-22T20:00:00Z |
| 270752 | 2013 | 7 | 22 | 1606.0 | 1615 | -9.0 | 2056.0 | 1831 | 145.0 | DL | 1619 | N970DL | LGA | MSP | 140.0 | 1020 | 16 | 15 | 2013-07-22T20:00:00Z |
| 270762 | 2013 | 7 | 22 | 1628.0 | 1630 | -2.0 | 2151.0 | 1939 | 132.0 | B6 | 423 | N625JB | JFK | LAX | 332.0 | 2475 | 16 | 30 | 2013-07-22T20:00:00Z |
| 276564 | 2013 | 7 | 28 | 1710.0 | 1711 | -1.0 | 2248.0 | 2039 | 129.0 | B6 | 167 | N510JB | JFK | OAK | 353.0 | 2576 | 17 | 11 | 2013-07-28T21:00:00Z |
| 287148 | 2013 | 8 | 8 | 1457.0 | 1500 | -3.0 | 1828.0 | 1624 | 124.0 | US | 2185 | N746UW | LGA | DCA | 70.0 | 214 | 15 | 0 | 2013-08-08T19:00:00Z |
| 291439 | 2013 | 8 | 13 | 657.0 | 659 | -2.0 | 1015.0 | 814 | 121.0 | EV | 4522 | N14188 | EWR | BNA | 146.0 | 748 | 6 | 59 | 2013-08-13T10:00:00Z |
| 305985 | 2013 | 8 | 28 | 1157.0 | 1200 | -3.0 | 1520.0 | 1316 | 124.0 | US | 2179 | N737US | LGA | DCA | 63.0 | 214 | 12 | 0 | 2013-08-28T16:00:00Z |
| 325764 | 2013 | 9 | 19 | 656.0 | 700 | -4.0 | 1037.0 | 833 | 124.0 | UA | 331 | N808UA | LGA | ORD | 192.0 | 733 | 7 | 0 | 2013-09-19T11:00:00Z |
# Were delayed by at least an hour, but made up over 30 minutes in flight
flights[(flights.dep_delay<60) & (flights.arr_delay<60) & (flights.air_time>30)]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336765 | 2013 | 9 | 30 | 2240.0 | 2245 | -5.0 | 2334.0 | 2351 | -17.0 | B6 | 1816 | N354JB | JFK | SYR | 41.0 | 209 | 22 | 45 | 2013-10-01T02:00:00Z |
| 336766 | 2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | B6 | 2002 | N281JB | JFK | BUF | 52.0 | 301 | 22 | 50 | 2013-10-01T02:00:00Z |
| 336767 | 2013 | 9 | 30 | 2241.0 | 2246 | -5.0 | 2345.0 | 1 | -16.0 | B6 | 486 | N346JB | JFK | ROC | 47.0 | 264 | 22 | 46 | 2013-10-01T02:00:00Z |
| 336768 | 2013 | 9 | 30 | 2307.0 | 2255 | 12.0 | 2359.0 | 2358 | 1.0 | B6 | 718 | N565JB | JFK | BOS | 33.0 | 187 | 22 | 55 | 2013-10-01T02:00:00Z |
| 336769 | 2013 | 9 | 30 | 2349.0 | 2359 | -10.0 | 325.0 | 350 | -25.0 | B6 | 745 | N516JB | JFK | PSE | 196.0 | 1617 | 23 | 59 | 2013-10-01T03:00:00Z |
294121 rows × 19 columns
# Departed between midnight and 6am (inclusive)
flights[(flights.dep_time>=0) & (flights.dep_time<=600)]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335802 | 2013 | 9 | 30 | 557.0 | 600 | -3.0 | 852.0 | 923 | -31.0 | UA | 303 | N510UA | JFK | SFO | 326.0 | 2586 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335803 | 2013 | 9 | 30 | 558.0 | 600 | -2.0 | 815.0 | 829 | -14.0 | EV | 4137 | N16981 | EWR | ATL | 107.0 | 746 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335804 | 2013 | 9 | 30 | 558.0 | 600 | -2.0 | 742.0 | 749 | -7.0 | DL | 731 | N337NB | LGA | DTW | 83.0 | 502 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335805 | 2013 | 9 | 30 | 559.0 | 600 | -1.0 | NaN | 715 | NaN | WN | 464 | N411WN | EWR | MDW | NaN | 711 | 6 | 0 | 2013-09-30T10:00:00Z |
| 335806 | 2013 | 9 | 30 | 600.0 | 600 | 0.0 | 844.0 | 856 | -12.0 | B6 | 601 | N588JB | JFK | FLL | 139.0 | 1069 | 6 | 0 | 2013-09-30T10:00:00Z |
9344 rows × 19 columns
# How many flights have a missing dep_time?
flights.dep_time.isnull().sum()
8255
# Sort flights to find the least delayed flights.
flights.sort_values(['dep_delay'])
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89673 | 2013 | 12 | 7 | 2040.0 | 2123 | -43.0 | 40.0 | 2352 | 48.0 | B6 | 97 | N592JB | JFK | DEN | 265.0 | 1626 | 21 | 23 | 2013-12-08T02:00:00Z |
| 113633 | 2013 | 2 | 3 | 2022.0 | 2055 | -33.0 | 2240.0 | 2338 | -58.0 | DL | 1715 | N612DL | LGA | MSY | 162.0 | 1183 | 20 | 55 | 2013-02-04T01:00:00Z |
| 64501 | 2013 | 11 | 10 | 1408.0 | 1440 | -32.0 | 1549.0 | 1559 | -10.0 | EV | 5713 | N825AS | LGA | IAD | 52.0 | 229 | 14 | 40 | 2013-11-10T19:00:00Z |
| 9619 | 2013 | 1 | 11 | 1900.0 | 1930 | -30.0 | 2233.0 | 2243 | -10.0 | DL | 1435 | N934DL | LGA | TPA | 139.0 | 1010 | 19 | 30 | 2013-01-12T00:00:00Z |
| 24915 | 2013 | 1 | 29 | 1703.0 | 1730 | -27.0 | 1947.0 | 1957 | -10.0 | F9 | 837 | N208FR | LGA | DEN | 250.0 | 1620 | 17 | 30 | 2013-01-29T22:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
#Find the flights that left earliest.
flights.sort_values(['dep_time'])
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 250450 | 2013 | 7 | 1 | 1.0 | 2029 | 212.0 | 236.0 | 2359 | 157.0 | B6 | 915 | N653JB | JFK | SFO | 315.0 | 2586 | 20 | 29 | 2013-07-02T00:00:00Z |
| 109552 | 2013 | 12 | 30 | 1.0 | 2359 | 2.0 | 441.0 | 437 | 4.0 | B6 | 839 | N508JB | JFK | BQN | 198.0 | 1576 | 23 | 59 | 2013-12-31T04:00:00Z |
| 240026 | 2013 | 6 | 20 | 1.0 | 2359 | 2.0 | 340.0 | 350 | -10.0 | B6 | 745 | N517JB | JFK | PSE | 196.0 | 1617 | 23 | 59 | 2013-06-21T03:00:00Z |
| 212954 | 2013 | 5 | 22 | 1.0 | 1935 | 266.0 | 154.0 | 2140 | 254.0 | EV | 4361 | N27200 | EWR | TYS | 94.0 | 631 | 19 | 35 | 2013-05-22T23:00:00Z |
| 215892 | 2013 | 5 | 25 | 1.0 | 2359 | 2.0 | 336.0 | 341 | -5.0 | B6 | 727 | N523JB | JFK | BQN | 189.0 | 1576 | 23 | 59 | 2013-05-26T03:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
# Which flights travelled the farthest? Which travelled the shortest?
flights.sort_values(['distance'], ascending=[True])
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 275945 | 2013 | 7 | 27 | NaN | 106 | NaN | NaN | 245 | NaN | US | 1632 | NaN | EWR | LGA | NaN | 17 | 1 | 6 | 2013-07-27T05:00:00Z |
| 3083 | 2013 | 1 | 4 | 1240.0 | 1200 | 40.0 | 1333.0 | 1306 | 27.0 | EV | 4193 | N14972 | EWR | PHL | 30.0 | 80 | 12 | 0 | 2013-01-04T17:00:00Z |
| 16328 | 2013 | 1 | 19 | 1617.0 | 1617 | 0.0 | 1722.0 | 1722 | 0.0 | EV | 4616 | N12540 | EWR | PHL | 34.0 | 80 | 16 | 17 | 2013-01-19T21:00:00Z |
| 112178 | 2013 | 2 | 1 | 2128.0 | 2129 | -1.0 | 2216.0 | 2224 | -8.0 | EV | 4619 | N13969 | EWR | PHL | 24.0 | 80 | 21 | 29 | 2013-02-02T02:00:00Z |
| 19983 | 2013 | 1 | 23 | 2128.0 | 2129 | -1.0 | 2221.0 | 2224 | -3.0 | EV | 4619 | N12135 | EWR | PHL | 23.0 | 80 | 21 | 29 | 2013-01-24T02:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99112 | 2013 | 12 | 18 | 928.0 | 930 | -2.0 | 1543.0 | 1535 | 8.0 | HA | 51 | N395HA | JFK | HNL | 641.0 | 4983 | 9 | 30 | 2013-12-18T14:00:00Z |
| 223207 | 2013 | 6 | 2 | 956.0 | 1000 | -4.0 | 1442.0 | 1435 | 7.0 | HA | 51 | N383HA | JFK | HNL | 617.0 | 4983 | 10 | 0 | 2013-06-02T14:00:00Z |
| 151311 | 2013 | 3 | 17 | 1006.0 | 1000 | 6.0 | 1607.0 | 1530 | 37.0 | HA | 51 | N380HA | JFK | HNL | 686.0 | 4983 | 10 | 0 | 2013-03-17T14:00:00Z |
| 218562 | 2013 | 5 | 28 | 953.0 | 1000 | -7.0 | 1447.0 | 1500 | -13.0 | HA | 51 | N385HA | JFK | HNL | 631.0 | 4983 | 10 | 0 | 2013-05-28T14:00:00Z |
| 289650 | 2013 | 8 | 11 | 950.0 | 1000 | -10.0 | 1438.0 | 1440 | -2.0 | HA | 51 | N391HA | JFK | HNL | 628.0 | 4983 | 10 | 0 | 2013-08-11T14:00:00Z |
336776 rows × 19 columns
Use at least three ways to select dep_time, dep_delay, arr_time, and arr_delay from flights.
# Method 1
flights[['dep_time','dep_delay','arr_time','arr_delay']]
| dep_time | dep_delay | arr_time | arr_delay | |
|---|---|---|---|---|
| 0 | 517.0 | 2.0 | 830.0 | 11.0 |
| 1 | 533.0 | 4.0 | 850.0 | 20.0 |
| 2 | 542.0 | 2.0 | 923.0 | 33.0 |
| 3 | 544.0 | -1.0 | 1004.0 | -18.0 |
| 4 | 554.0 | -6.0 | 812.0 | -25.0 |
| ... | ... | ... | ... | ... |
| 336771 | NaN | NaN | NaN | NaN |
| 336772 | NaN | NaN | NaN | NaN |
| 336773 | NaN | NaN | NaN | NaN |
| 336774 | NaN | NaN | NaN | NaN |
| 336775 | NaN | NaN | NaN | NaN |
336776 rows × 4 columns
# Method 2
flights.iloc[:,[3,5,6,8]]
| dep_time | dep_delay | arr_time | arr_delay | |
|---|---|---|---|---|
| 0 | 517.0 | 2.0 | 830.0 | 11.0 |
| 1 | 533.0 | 4.0 | 850.0 | 20.0 |
| 2 | 542.0 | 2.0 | 923.0 | 33.0 |
| 3 | 544.0 | -1.0 | 1004.0 | -18.0 |
| 4 | 554.0 | -6.0 | 812.0 | -25.0 |
| ... | ... | ... | ... | ... |
| 336771 | NaN | NaN | NaN | NaN |
| 336772 | NaN | NaN | NaN | NaN |
| 336773 | NaN | NaN | NaN | NaN |
| 336774 | NaN | NaN | NaN | NaN |
| 336775 | NaN | NaN | NaN | NaN |
336776 rows × 4 columns
# Method 3
flights.loc[:,('dep_time','dep_delay','arr_time','arr_delay')]
| dep_time | dep_delay | arr_time | arr_delay | |
|---|---|---|---|---|
| 0 | 517.0 | 2.0 | 830.0 | 11.0 |
| 1 | 533.0 | 4.0 | 850.0 | 20.0 |
| 2 | 542.0 | 2.0 | 923.0 | 33.0 |
| 3 | 544.0 | -1.0 | 1004.0 | -18.0 |
| 4 | 554.0 | -6.0 | 812.0 | -25.0 |
| ... | ... | ... | ... | ... |
| 336771 | NaN | NaN | NaN | NaN |
| 336772 | NaN | NaN | NaN | NaN |
| 336773 | NaN | NaN | NaN | NaN |
| 336774 | NaN | NaN | NaN | NaN |
| 336775 | NaN | NaN | NaN | NaN |
336776 rows × 4 columns
Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers.
For example, 759 means 7:59 and 801 means 8:01. Their difference is not 42 but 2 minutes.
# Convert them to a more convenient representation of number of minutes since midnight (0).
flights_sml = flights.filter(['dep_time','arr_time','sched_dep_time','air_time','dep_delay'])
flights_sml['dhour'] = flights_sml.dep_time//100
flights_sml['dminute'] = flights_sml.dep_time%100
flights_sml['dep_time'] = flights_sml.dhour*60 + flights_sml.dminute
flights_sml['ahour'] = flights_sml.arr_time//100
flights_sml['aminute'] = flights_sml.arr_time%100
flights_sml['arr_time'] = flights_sml.ahour*60 + flights_sml.aminute
flights_sml['shour'] = flights_sml.sched_dep_time//100
flights_sml['sminute'] = flights_sml.sched_dep_time%100
flights_sml['sched_dep_time'] = flights_sml.shour*60 + flights_sml.sminute
flights_sml
| dep_time | arr_time | sched_dep_time | air_time | dep_delay | dhour | dminute | ahour | aminute | shour | sminute | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 317.0 | 510.0 | 315 | 227.0 | 2.0 | 5.0 | 17.0 | 8.0 | 30.0 | 5 | 15 |
| 1 | 333.0 | 530.0 | 329 | 227.0 | 4.0 | 5.0 | 33.0 | 8.0 | 50.0 | 5 | 29 |
| 2 | 342.0 | 563.0 | 340 | 160.0 | 2.0 | 5.0 | 42.0 | 9.0 | 23.0 | 5 | 40 |
| 3 | 344.0 | 604.0 | 345 | 183.0 | -1.0 | 5.0 | 44.0 | 10.0 | 4.0 | 5 | 45 |
| 4 | 354.0 | 492.0 | 360 | 116.0 | -6.0 | 5.0 | 54.0 | 8.0 | 12.0 | 6 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | NaN | NaN | 895 | NaN | NaN | NaN | NaN | NaN | NaN | 14 | 55 |
| 336772 | NaN | NaN | 1320 | NaN | NaN | NaN | NaN | NaN | NaN | 22 | 0 |
| 336773 | NaN | NaN | 730 | NaN | NaN | NaN | NaN | NaN | NaN | 12 | 10 |
| 336774 | NaN | NaN | 719 | NaN | NaN | NaN | NaN | NaN | NaN | 11 | 59 |
| 336775 | NaN | NaN | 520 | NaN | NaN | NaN | NaN | NaN | NaN | 8 | 40 |
336776 rows × 11 columns
# Create a new column of arr_time - dep_time.
flights_sml['airtime']=flights_sml.arr_time-flights_sml.dep_time
# Compare this column with air_time.
flights_sml_air=flights_sml.filter(['airtime','air_time'])
flights_sml_air
| airtime | air_time | |
|---|---|---|
| 0 | 193.0 | 227.0 |
| 1 | 197.0 | 227.0 |
| 2 | 221.0 | 160.0 |
| 3 | 260.0 | 183.0 |
| 4 | 138.0 | 116.0 |
| ... | ... | ... |
| 336771 | NaN | NaN |
| 336772 | NaN | NaN |
| 336773 | NaN | NaN |
| 336774 | NaN | NaN |
| 336775 | NaN | NaN |
336776 rows × 2 columns
# Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
# Try creating a column to calculate dep_delay from dep_time and sched_dep_time (and/or other columns if necessary).
flights_sml_dep=flights_sml.filter(['dep_time','sched_dep_time','dep_delay'])
flights_sml_dep['depdelay']=flights_sml.dep_time-flights_sml.sched_dep_time
# Test your results.
flights_sml_dep
| dep_time | sched_dep_time | dep_delay | depdelay | |
|---|---|---|---|---|
| 0 | 317.0 | 315 | 2.0 | 2.0 |
| 1 | 333.0 | 329 | 4.0 | 4.0 |
| 2 | 342.0 | 340 | 2.0 | 2.0 |
| 3 | 344.0 | 345 | -1.0 | -1.0 |
| 4 | 354.0 | 360 | -6.0 | -6.0 |
| ... | ... | ... | ... | ... |
| 336771 | NaN | 895 | NaN | NaN |
| 336772 | NaN | 1320 | NaN | NaN |
| 336773 | NaN | 730 | NaN | NaN |
| 336774 | NaN | 719 | NaN | NaN |
| 336775 | NaN | 520 | NaN | NaN |
336776 rows × 4 columns
The following questions may require multiple operations above.
# Find the 20 most delayed flights.
# Display the following: year,month,day,carrier,flight,dep_delay,arr_delay,carrier
# How do you want to handle ties?
(
flights
.head(20)
.sort_values(['dep_delay' or 'arr_delay'], ascending=[False])
.filter(['year','month','day','carrier','flight','dep_delay','arr_delay'])
)
| year | month | day | carrier | flight | dep_delay | arr_delay | |
|---|---|---|---|---|---|---|---|
| 1 | 2013 | 1 | 1 | UA | 1714 | 4.0 | 20.0 |
| 0 | 2013 | 1 | 1 | UA | 1545 | 2.0 | 11.0 |
| 2 | 2013 | 1 | 1 | AA | 1141 | 2.0 | 33.0 |
| 19 | 2013 | 1 | 1 | B6 | 343 | 1.0 | -6.0 |
| 15 | 2013 | 1 | 1 | B6 | 1806 | 0.0 | -4.0 |
| 17 | 2013 | 1 | 1 | B6 | 371 | 0.0 | -7.0 |
| 18 | 2013 | 1 | 1 | MQ | 4650 | 0.0 | 12.0 |
| 3 | 2013 | 1 | 1 | B6 | 725 | -1.0 | -18.0 |
| 14 | 2013 | 1 | 1 | AA | 707 | -1.0 | 31.0 |
| 16 | 2013 | 1 | 1 | UA | 1187 | -1.0 | -8.0 |
| 9 | 2013 | 1 | 1 | AA | 301 | -2.0 | 8.0 |
| 11 | 2013 | 1 | 1 | B6 | 71 | -2.0 | -3.0 |
| 12 | 2013 | 1 | 1 | UA | 194 | -2.0 | 7.0 |
| 13 | 2013 | 1 | 1 | UA | 1124 | -2.0 | -14.0 |
| 10 | 2013 | 1 | 1 | B6 | 49 | -2.0 | -2.0 |
| 7 | 2013 | 1 | 1 | EV | 5708 | -3.0 | -14.0 |
| 8 | 2013 | 1 | 1 | B6 | 79 | -3.0 | -8.0 |
| 5 | 2013 | 1 | 1 | UA | 1696 | -4.0 | 12.0 |
| 6 | 2013 | 1 | 1 | B6 | 507 | -5.0 | 19.0 |
| 4 | 2013 | 1 | 1 | DL | 461 | -6.0 | -25.0 |
# Sort all AA flights to find the top 10 fastest (highest speed) flights.
# Display the following: year,month,day,carrier,flight,orig,dest,distance,air_time,speed (miles per hour)
flights_ml=flights.filter(['year','month','day','carrier','flight','orig','dest','distance','air_time'])
flights_ml['speed']=flights_ml.distance/flights_ml.air_time*60
flights_ml.head(10).sort_values(['speed'],ascending=[False])
| year | month | day | carrier | flight | dest | distance | air_time | speed | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 2013 | 1 | 1 | B6 | 725 | BQN | 1576 | 183.0 | 516.721311 |
| 2 | 2013 | 1 | 1 | AA | 1141 | MIA | 1089 | 160.0 | 408.375000 |
| 8 | 2013 | 1 | 1 | B6 | 79 | MCO | 944 | 140.0 | 404.571429 |
| 6 | 2013 | 1 | 1 | B6 | 507 | FLL | 1065 | 158.0 | 404.430380 |
| 4 | 2013 | 1 | 1 | DL | 461 | ATL | 762 | 116.0 | 394.137931 |
| 1 | 2013 | 1 | 1 | UA | 1714 | IAH | 1416 | 227.0 | 374.273128 |
| 0 | 2013 | 1 | 1 | UA | 1545 | IAH | 1400 | 227.0 | 370.044053 |
| 9 | 2013 | 1 | 1 | AA | 301 | ORD | 733 | 138.0 | 318.695652 |
| 5 | 2013 | 1 | 1 | UA | 1696 | ORD | 719 | 150.0 | 287.600000 |
| 7 | 2013 | 1 | 1 | EV | 5708 | IAD | 229 | 53.0 | 259.245283 |
# Find all flights that satisfy the following:
# - From John F. Kennedy Airpot (JFK) or Newark Aiport (EWR) to Seattle-Tacoma Airport (SEA)
# - Carrier is UA, AA, or DL.
# - Dates from 4/1/2013 (inclusive) to 4/3/2013 (inclusive)
# - Scheduled arrival time is before noon.
# - Display the following: year,month,day,carrier,flight,origin,dest,sched_dep_time,sched_arr_time
# - Sort by year, month, day, sched_arr_time
(
flights
[((flights.origin=='JFK')|(flights.origin=='EWR')) #flights.origin.isin(['JFK','EWR'])
&(flights.dest=='SEA')
&((flights.carrier=='UA')|(flights.carrier=='AA')|(flights.carrier=='DL'))
&((flights.year==2013)&(flights.month==4)&(flights.day>=1)&(flights.day<=3))
&(flights.sched_arr_time<1200)]
.filter(['year','month','day','carrier','flight','origin','dest','sched_dep_time','sched_arr_time'])
.sort_values(['year','month','day', 'sched_arr_time'])
)
| year | month | day | carrier | flight | origin | dest | sched_dep_time | sched_arr_time | |
|---|---|---|---|---|---|---|---|---|---|
| 165210 | 2013 | 4 | 1 | DL | 183 | JFK | SEA | 745 | 1100 |
| 166180 | 2013 | 4 | 2 | DL | 183 | JFK | SEA | 745 | 1100 |
| 167168 | 2013 | 4 | 3 | DL | 183 | JFK | SEA | 745 | 1100 |
You have learned several key operations that allow you to solve the vast majority of your data manipulation challenges:
These can all be used in conjunction with groupby() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.
import pandas as pd
import numpy as np
from nycflights13 import flights
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
Date of departure
Actual departure and arrival times (format HHMM or HMM), local tz.
Scheduled departure and arrival times (format HHMM or HMM), local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Time of scheduled departure broken into hour and minutes.
Two letter carrier abbreviation. See airlines() to get name
Plane tail number
Flight number
Origin and destination. See airports() for additional metadata.
Amount of time spent in the air, in minutes
Distance between airports, in miles
Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.
# Basic descriptive statistics for each column
flights.describe()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | flight | air_time | distance | hour | minute | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 336776.0 | 336776.000000 | 336776.000000 | 328521.000000 | 336776.000000 | 328521.000000 | 328063.000000 | 336776.000000 | 327346.000000 | 336776.000000 | 327346.000000 | 336776.000000 | 336776.000000 | 336776.000000 |
| mean | 2013.0 | 6.548510 | 15.710787 | 1349.109947 | 1344.254840 | 12.639070 | 1502.054999 | 1536.380220 | 6.895377 | 1971.923620 | 150.686460 | 1039.912604 | 13.180247 | 26.230100 |
| std | 0.0 | 3.414457 | 8.768607 | 488.281791 | 467.335756 | 40.210061 | 533.264132 | 497.457142 | 44.633292 | 1632.471938 | 93.688305 | 733.233033 | 4.661316 | 19.300846 |
| min | 2013.0 | 1.000000 | 1.000000 | 1.000000 | 106.000000 | -43.000000 | 1.000000 | 1.000000 | -86.000000 | 1.000000 | 20.000000 | 17.000000 | 1.000000 | 0.000000 |
| 25% | 2013.0 | 4.000000 | 8.000000 | 907.000000 | 906.000000 | -5.000000 | 1104.000000 | 1124.000000 | -17.000000 | 553.000000 | 82.000000 | 502.000000 | 9.000000 | 8.000000 |
| 50% | 2013.0 | 7.000000 | 16.000000 | 1401.000000 | 1359.000000 | -2.000000 | 1535.000000 | 1556.000000 | -5.000000 | 1496.000000 | 129.000000 | 872.000000 | 13.000000 | 29.000000 |
| 75% | 2013.0 | 10.000000 | 23.000000 | 1744.000000 | 1729.000000 | 11.000000 | 1940.000000 | 1945.000000 | 14.000000 | 3465.000000 | 192.000000 | 1389.000000 | 17.000000 | 44.000000 |
| max | 2013.0 | 12.000000 | 31.000000 | 2400.000000 | 2359.000000 | 1301.000000 | 2400.000000 | 2359.000000 | 1272.000000 | 8500.000000 | 695.000000 | 4983.000000 | 23.000000 | 59.000000 |
flights.describe(include='all')
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 336776.0 | 336776.000000 | 336776.000000 | 328521.000000 | 336776.000000 | 328521.000000 | 328063.000000 | 336776.000000 | 327346.000000 | 336776 | 336776.000000 | 334264 | 336776 | 336776 | 327346.000000 | 336776.000000 | 336776.000000 | 336776.000000 | 336776 |
| unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 16 | NaN | 4043 | 3 | 105 | NaN | NaN | NaN | NaN | 6936 |
| top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | UA | NaN | N725MQ | EWR | ORD | NaN | NaN | NaN | NaN | 2013-09-13T12:00:00Z |
| freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58665 | NaN | 575 | 120835 | 17283 | NaN | NaN | NaN | NaN | 94 |
| mean | 2013.0 | 6.548510 | 15.710787 | 1349.109947 | 1344.254840 | 12.639070 | 1502.054999 | 1536.380220 | 6.895377 | NaN | 1971.923620 | NaN | NaN | NaN | 150.686460 | 1039.912604 | 13.180247 | 26.230100 | NaN |
| std | 0.0 | 3.414457 | 8.768607 | 488.281791 | 467.335756 | 40.210061 | 533.264132 | 497.457142 | 44.633292 | NaN | 1632.471938 | NaN | NaN | NaN | 93.688305 | 733.233033 | 4.661316 | 19.300846 | NaN |
| min | 2013.0 | 1.000000 | 1.000000 | 1.000000 | 106.000000 | -43.000000 | 1.000000 | 1.000000 | -86.000000 | NaN | 1.000000 | NaN | NaN | NaN | 20.000000 | 17.000000 | 1.000000 | 0.000000 | NaN |
| 25% | 2013.0 | 4.000000 | 8.000000 | 907.000000 | 906.000000 | -5.000000 | 1104.000000 | 1124.000000 | -17.000000 | NaN | 553.000000 | NaN | NaN | NaN | 82.000000 | 502.000000 | 9.000000 | 8.000000 | NaN |
| 50% | 2013.0 | 7.000000 | 16.000000 | 1401.000000 | 1359.000000 | -2.000000 | 1535.000000 | 1556.000000 | -5.000000 | NaN | 1496.000000 | NaN | NaN | NaN | 129.000000 | 872.000000 | 13.000000 | 29.000000 | NaN |
| 75% | 2013.0 | 10.000000 | 23.000000 | 1744.000000 | 1729.000000 | 11.000000 | 1940.000000 | 1945.000000 | 14.000000 | NaN | 3465.000000 | NaN | NaN | NaN | 192.000000 | 1389.000000 | 17.000000 | 44.000000 | NaN |
| max | 2013.0 | 12.000000 | 31.000000 | 2400.000000 | 2359.000000 | 1301.000000 | 2400.000000 | 2359.000000 | 1272.000000 | NaN | 8500.000000 | NaN | NaN | NaN | 695.000000 | 4983.000000 | 23.000000 | 59.000000 | NaN |
# Dimensions of the dataframe
flights.shape
(336776, 19)
# Number of rows in the dataframe
len(flights)
336776
# Number of distinct values in a column.
flights['carrier'].nunique()
16
# Count number of rows with each unique value of variable
flights['carrier'].value_counts()
UA 58665 B6 54635 EV 54173 DL 48110 AA 32729 MQ 26397 US 20536 9E 18460 WN 12275 VX 5162 FL 3260 AS 714 F9 685 YV 601 HA 342 OO 32 Name: carrier, dtype: int64
Pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, and produce single values for each of the groups. When applied to a DataFrame, the result is returned as a pandas Series for each column. Examples:
sum() Sum values of each object.count() Count non-NA/null values of each object.median() Median value of each object.quantile([0.25,0.75]) Quantiles of each object.min() Minimum value in each object.max() Maximum value in each object.mean() Mean value of each object.var() Variance of each object.std() Standard deviation of each object.apply(function) Apply function to each object.These summary functions can be applied to all the rows in the dataframe.
flights.count()
year 336776 month 336776 day 336776 dep_time 328521 sched_dep_time 336776 dep_delay 328521 arr_time 328063 sched_arr_time 336776 arr_delay 327346 carrier 336776 flight 336776 tailnum 334264 origin 336776 dest 336776 air_time 327346 distance 336776 hour 336776 minute 336776 time_hour 336776 dtype: int64
# Count the number of flights (rows)
flights['flight'].count()
336776
# Sum up the total distance of all flights.
flights['distance'].sum()
350217607
flights.distance.sum()
350217607
# Average/mean of arrival delay
flights['arr_delay'].mean()
6.89537675731489
# Apply a function to multiple columns
flights[['distance','air_time']].max()
distance 4983.0 air_time 695.0 dtype: float64
flights.max()
C:\Users\sotni\AppData\Local\Temp/ipykernel_2284/2580502188.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. flights.max()
year 2013 month 12 day 31 dep_time 2400.0 sched_dep_time 2359 dep_delay 1301.0 arr_time 2400.0 sched_arr_time 2359 arr_delay 1272.0 carrier YV flight 8500 origin LGA dest XNA air_time 695.0 distance 4983 hour 23 minute 59 time_hour 2014-01-01T04:00:00Z dtype: object
# Average departure delay of all flights from JFK to SEA
(
flights
[flights.origin.isin(['JFK']) & (flights.dest=='SEA')]
.dep_delay
.mean()
)
8.8875
These summary functions are not terribly useful unless we pair them with groupby(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use a summary function on a grouped data frame they’ll be automatically applied “by group”.
flights.groupby('carrier').size()
carrier 9E 18460 AA 32729 AS 714 B6 54635 DL 48110 EV 54173 F9 685 FL 3260 HA 342 MQ 26397 OO 32 UA 58665 US 20536 VX 5162 WN 12275 YV 601 dtype: int64
flights.carrier.value_counts()
UA 58665 B6 54635 EV 54173 DL 48110 AA 32729 MQ 26397 US 20536 9E 18460 WN 12275 VX 5162 FL 3260 AS 714 F9 685 YV 601 HA 342 OO 32 Name: carrier, dtype: int64
flights.groupby(['carrier','flight']).size()
carrier flight
9E 2900 59
2901 55
2902 55
2903 56
2904 57
..
YV 3778 3
3788 23
3790 9
3791 15
3799 1
Length: 5725, dtype: int64
(flights
.groupby(['year','month','day'])
['arr_delay']
.mean())
year month day
2013 1 1 12.651023
2 12.692888
3 5.733333
4 -1.932819
5 -1.525802
...
12 27 -0.148803
28 -3.259533
29 18.763825
30 10.057712
31 6.212121
Name: arr_delay, Length: 365, dtype: float64
flights.groupby(['year','month'])['arr_delay'].agg(['mean','std','min','max'])
| mean | std | min | max | ||
|---|---|---|---|---|---|
| year | month | ||||
| 2013 | 1 | 6.129972 | 40.423898 | -70.0 | 1272.0 |
| 2 | 5.613019 | 39.528619 | -70.0 | 834.0 | |
| 3 | 5.807577 | 44.119192 | -68.0 | 915.0 | |
| 4 | 11.176063 | 47.491151 | -68.0 | 931.0 | |
| 5 | 3.521509 | 44.237613 | -86.0 | 875.0 | |
| 6 | 16.481330 | 56.130866 | -64.0 | 1127.0 | |
| 7 | 16.711307 | 57.117088 | -66.0 | 989.0 | |
| 8 | 6.040652 | 42.595142 | -68.0 | 490.0 | |
| 9 | -4.018364 | 39.710309 | -68.0 | 1007.0 | |
| 10 | -0.167063 | 32.649858 | -61.0 | 688.0 | |
| 11 | 0.461347 | 31.387406 | -67.0 | 796.0 | |
| 12 | 14.870355 | 46.133110 | -68.0 | 878.0 |
flights.groupby(['year','month'])['arr_delay'].agg(['mean','std','min','max']).reset_index()
| year | month | mean | std | min | max | |
|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 6.129972 | 40.423898 | -70.0 | 1272.0 |
| 1 | 2013 | 2 | 5.613019 | 39.528619 | -70.0 | 834.0 |
| 2 | 2013 | 3 | 5.807577 | 44.119192 | -68.0 | 915.0 |
| 3 | 2013 | 4 | 11.176063 | 47.491151 | -68.0 | 931.0 |
| 4 | 2013 | 5 | 3.521509 | 44.237613 | -86.0 | 875.0 |
| 5 | 2013 | 6 | 16.481330 | 56.130866 | -64.0 | 1127.0 |
| 6 | 2013 | 7 | 16.711307 | 57.117088 | -66.0 | 989.0 |
| 7 | 2013 | 8 | 6.040652 | 42.595142 | -68.0 | 490.0 |
| 8 | 2013 | 9 | -4.018364 | 39.710309 | -68.0 | 1007.0 |
| 9 | 2013 | 10 | -0.167063 | 32.649858 | -61.0 | 688.0 |
| 10 | 2013 | 11 | 0.461347 | 31.387406 | -67.0 | 796.0 |
| 11 | 2013 | 12 | 14.870355 | 46.133110 | -68.0 | 878.0 |
What if we want to apply different summary functions to different columns?
flights.groupby('carrier').agg({'flight': 'size',
'distance': 'sum',
'arr_delay': ['mean','std'],
'hour': lambda x: x.max()-x.min()
})
| flight | distance | arr_delay | hour | ||
|---|---|---|---|---|---|
| size | sum | mean | std | <lambda> | |
| carrier | |||||
| 9E | 18460 | 9788152 | 7.379669 | 50.086778 | 16 |
| AA | 32729 | 43864584 | 0.364291 | 42.516182 | 16 |
| AS | 714 | 1715028 | -9.930889 | 36.482633 | 11 |
| B6 | 54635 | 58384137 | 9.457973 | 42.842297 | 18 |
| DL | 48110 | 59507317 | 1.644341 | 44.402289 | 17 |
| EV | 54173 | 30498951 | 15.796431 | 49.861469 | 17 |
| F9 | 685 | 1109700 | 21.920705 | 61.645997 | 10 |
| FL | 3260 | 2167344 | 20.115906 | 54.087671 | 14 |
| HA | 342 | 1704186 | -6.915205 | 75.129420 | 1 |
| MQ | 26397 | 15033955 | 10.774733 | 43.174306 | 15 |
| OO | 32 | 16026 | 11.931034 | 48.584926 | 7 |
| UA | 58665 | 89705524 | 3.558011 | 40.984344 | 18 |
| US | 20536 | 11365778 | 2.129595 | 33.066952 | 20 |
| VX | 5162 | 12902327 | 1.764464 | 49.966450 | 13 |
| WN | 12275 | 12229203 | 9.649120 | 46.877702 | 15 |
| YV | 601 | 225395 | 15.556985 | 52.922234 | 14 |
# You can also create a function to include multiple aggregate functions on different columns.
# In this way, you can give a name for each new column in the resulting dataframe.
def f(x):
d = {}
d['flight_count'] = x['flight'].count()
d['total_distance'] = x['distance'].sum()
d['arr_delay_mean'] = x['arr_delay'].mean()
d['arr_delay_std'] = x['arr_delay'].std()
d['hour_range'] = x['hour'].max() - x['hour'].min()
return pd.Series(d)
flights.groupby('carrier').apply(f)
| flight_count | total_distance | arr_delay_mean | arr_delay_std | hour_range | |
|---|---|---|---|---|---|
| carrier | |||||
| 9E | 18460.0 | 9788152.0 | 7.379669 | 50.086778 | 16.0 |
| AA | 32729.0 | 43864584.0 | 0.364291 | 42.516182 | 16.0 |
| AS | 714.0 | 1715028.0 | -9.930889 | 36.482633 | 11.0 |
| B6 | 54635.0 | 58384137.0 | 9.457973 | 42.842297 | 18.0 |
| DL | 48110.0 | 59507317.0 | 1.644341 | 44.402289 | 17.0 |
| EV | 54173.0 | 30498951.0 | 15.796431 | 49.861469 | 17.0 |
| F9 | 685.0 | 1109700.0 | 21.920705 | 61.645997 | 10.0 |
| FL | 3260.0 | 2167344.0 | 20.115906 | 54.087671 | 14.0 |
| HA | 342.0 | 1704186.0 | -6.915205 | 75.129420 | 1.0 |
| MQ | 26397.0 | 15033955.0 | 10.774733 | 43.174306 | 15.0 |
| OO | 32.0 | 16026.0 | 11.931034 | 48.584926 | 7.0 |
| UA | 58665.0 | 89705524.0 | 3.558011 | 40.984344 | 18.0 |
| US | 20536.0 | 11365778.0 | 2.129595 | 33.066952 | 20.0 |
| VX | 5162.0 | 12902327.0 | 1.764464 | 49.966450 | 13.0 |
| WN | 12275.0 | 12229203.0 | 9.649120 | 46.877702 | 15.0 |
| YV | 601.0 | 225395.0 | 15.556985 | 52.922234 | 14.0 |
Now let's put multiple operators we've learned together. Imagine that we want to explore the relationship between the distance and average delay for each destination.
There are four steps to prepare this data:
dest_df = flights.groupby('dest').agg({'flight': 'size',
'distance': 'mean',
'arr_delay': 'mean'}).reset_index()
dest_df
| dest | flight | distance | arr_delay | |
|---|---|---|---|---|
| 0 | ABQ | 254 | 1826.000000 | 4.381890 |
| 1 | ACK | 265 | 199.000000 | 4.852273 |
| 2 | ALB | 439 | 143.000000 | 14.397129 |
| 3 | ANC | 8 | 3370.000000 | -2.500000 |
| 4 | ATL | 17215 | 757.108220 | 11.300113 |
| ... | ... | ... | ... | ... |
| 100 | TPA | 7466 | 1003.935575 | 7.408525 |
| 101 | TUL | 315 | 1215.000000 | 33.659864 |
| 102 | TVC | 101 | 652.386139 | 12.968421 |
| 103 | TYS | 631 | 638.809826 | 24.069204 |
| 104 | XNA | 1036 | 1142.505792 | 7.465726 |
105 rows × 4 columns
dest_df = flights.groupby('dest').agg({'flight': 'size',
'distance': 'mean',
'arr_delay': 'mean'}).reset_index()
#dest_df
dest_no_hnl = dest_df[dest_df.dest!='HNL']
#dest_no_hnl
dest_sorted = dest_no_hnl.sort_values('arr_delay', ascending=False)
dest_sorted
| dest | flight | distance | arr_delay | |
|---|---|---|---|---|
| 18 | CAE | 116 | 603.551724 | 41.764151 |
| 101 | TUL | 315 | 1215.000000 | 33.659864 |
| 67 | OKC | 346 | 1325.000000 | 30.619048 |
| 46 | JAC | 25 | 1875.600000 | 28.095238 |
| 103 | TYS | 631 | 638.809826 | 24.069204 |
| ... | ... | ... | ... | ... |
| 98 | STT | 522 | 1626.982759 | -3.835907 |
| 95 | SNA | 825 | 2434.000000 | -7.868227 |
| 77 | PSP | 19 | 2378.000000 | -12.722222 |
| 50 | LEX | 1 | 604.000000 | -22.000000 |
| 51 | LGA | 1 | 17.000000 | NaN |
104 rows × 4 columns
# Put everything in one line of code:
(
flights
.groupby('dest')
.agg(
{'flight': 'size',
'distance': 'mean',
'arr_delay': 'mean'}
).reset_index()
.query("dest!='HNL'")
.sort_values('arr_delay', ascending=False)
)
| dest | flight | distance | arr_delay | |
|---|---|---|---|---|
| 18 | CAE | 116 | 603.551724 | 41.764151 |
| 101 | TUL | 315 | 1215.000000 | 33.659864 |
| 67 | OKC | 346 | 1325.000000 | 30.619048 |
| 46 | JAC | 25 | 1875.600000 | 28.095238 |
| 103 | TYS | 631 | 638.809826 | 24.069204 |
| ... | ... | ... | ... | ... |
| 98 | STT | 522 | 1626.982759 | -3.835907 |
| 95 | SNA | 825 | 2434.000000 | -7.868227 |
| 77 | PSP | 19 | 2378.000000 | -12.722222 |
| 50 | LEX | 1 | 604.000000 | -22.000000 |
| 51 | LGA | 1 | 17.000000 | NaN |
104 rows × 4 columns
# Assumption: flights that are not cancelled should have values for dep_delay and arr_delay.
not_cancelled = flights[flights.dep_delay.notnull() & flights.arr_delay.notnull()]
not_cancelled
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336765 | 2013 | 9 | 30 | 2240.0 | 2245 | -5.0 | 2334.0 | 2351 | -17.0 | B6 | 1816 | N354JB | JFK | SYR | 41.0 | 209 | 22 | 45 | 2013-10-01T02:00:00Z |
| 336766 | 2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | B6 | 2002 | N281JB | JFK | BUF | 52.0 | 301 | 22 | 50 | 2013-10-01T02:00:00Z |
| 336767 | 2013 | 9 | 30 | 2241.0 | 2246 | -5.0 | 2345.0 | 1 | -16.0 | B6 | 486 | N346JB | JFK | ROC | 47.0 | 264 | 22 | 46 | 2013-10-01T02:00:00Z |
| 336768 | 2013 | 9 | 30 | 2307.0 | 2255 | 12.0 | 2359.0 | 2358 | 1.0 | B6 | 718 | N565JB | JFK | BOS | 33.0 | 187 | 22 | 55 | 2013-10-01T02:00:00Z |
| 336769 | 2013 | 9 | 30 | 2349.0 | 2359 | -10.0 | 325.0 | 350 | -25.0 | B6 | 745 | N516JB | JFK | PSE | 196.0 | 1617 | 23 | 59 | 2013-10-01T03:00:00Z |
327346 rows × 19 columns
delays = not_cancelled.groupby('tailnum').agg({'arr_delay':'mean', 'flight':'count'})
delays
| arr_delay | flight | |
|---|---|---|
| tailnum | ||
| D942DN | 31.500000 | 4 |
| N0EGMQ | 9.982955 | 352 |
| N10156 | 12.717241 | 145 |
| N102UW | 2.937500 | 48 |
| N103US | -6.934783 | 46 |
| ... | ... | ... |
| N997DL | 4.903226 | 62 |
| N998AT | 29.960000 | 25 |
| N998DL | 16.394737 | 76 |
| N999DN | 14.311475 | 61 |
| N9EAMQ | 9.235294 | 238 |
4037 rows × 2 columns
ax = delays['arr_delay'].plot.hist(bins=50)
# Wow, there are some planes that have an average delay of 5 hours (300 minutes)!
ax = delays.plot.scatter(x='arr_delay', y='flight')
Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups.
ax = delays.query('flight>25').plot.scatter(x='arr_delay', y='flight')
# How many flights left before 5am on each day?
# (these usually indicate delayed flights from the previous day)
(
not_cancelled
.query('dep_time<500')
.groupby(['year','month','day'])
.size()
)
year month day
2013 1 2 3
3 4
4 3
5 3
6 2
..
12 27 7
28 2
29 3
30 6
31 4
Length: 348, dtype: int64
#which day has the most flights before 5 am?
(
not_cancelled
.query('dep_time<500')
.groupby(['year','month','day'])
.agg({'flight':'size'})
.reset_index()
.sort_values('flight', ascending=False)
.head(1)
)
| year | month | day | flight | |
|---|---|---|---|---|
| 172 | 2013 | 6 | 28 | 32 |
Grouping is most useful in conjunction with aggregate functions. But you can also do other operations within groups:
# Find the worst members of each group:
# Find the top three flights with the longest arr_delay everyday.
flights['rank_daily_delay'] = flights.groupby(['year', 'month', 'day'])['arr_delay'].rank(method='min',ascending=False)
flights[flights.rank_daily_delay<=3]
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | ... | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | rank_daily_delay | prop_delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 151 | 2013 | 1 | 1 | 848.0 | 1835 | 853.0 | 1001.0 | 1950 | 851.0 | MQ | ... | N942MQ | JFK | BWI | 41.0 | 184 | 18 | 35 | 2013-01-01T23:00:00Z | 1.0 | 0.025090 |
| 649 | 2013 | 1 | 1 | 1815.0 | 1325 | 290.0 | 2120.0 | 1542 | 338.0 | EV | ... | N17185 | EWR | OMA | 213.0 | 1134 | 13 | 25 | 2013-01-01T18:00:00Z | 3.0 | 0.017560 |
| 834 | 2013 | 1 | 1 | 2343.0 | 1724 | 379.0 | 314.0 | 1938 | 456.0 | EV | ... | N21197 | EWR | MCI | 222.0 | 1092 | 17 | 24 | 2013-01-01T22:00:00Z | 2.0 | 0.010116 |
| 1310 | 2013 | 1 | 2 | 1412.0 | 838 | 334.0 | 1710.0 | 1147 | 323.0 | UA | ... | N474UA | EWR | MCO | 150.0 | 937 | 8 | 38 | 2013-01-02T13:00:00Z | 3.0 | 0.001567 |
| 1440 | 2013 | 1 | 2 | 1607.0 | 1030 | 337.0 | 2003.0 | 1355 | 368.0 | AA | ... | N324AA | JFK | SFO | 346.0 | 2586 | 10 | 30 | 2013-01-02T15:00:00Z | 1.0 | 0.001792 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335540 | 2013 | 9 | 29 | 1745.0 | 1330 | 255.0 | 2028.0 | 1632 | 236.0 | B6 | ... | N517JB | LGA | SRQ | 137.0 | 1047 | 13 | 30 | 2013-09-29T17:00:00Z | 1.0 | 0.015258 |
| 335778 | 2013 | 9 | 29 | 2327.0 | 1942 | 225.0 | 153.0 | 2250 | 183.0 | B6 | ... | N659JB | LGA | FLL | 129.0 | 1076 | 19 | 42 | 2013-09-29T23:00:00Z | 3.0 | 0.000903 |
| 336252 | 2013 | 9 | 30 | 1324.0 | 830 | 294.0 | 1512.0 | 1040 | 272.0 | EV | ... | N761ND | LGA | CLT | 79.0 | 544 | 8 | 30 | 2013-09-30T12:00:00Z | 1.0 | 0.001311 |
| 336668 | 2013 | 9 | 30 | 1951.0 | 1649 | 182.0 | 2157.0 | 1903 | 174.0 | EV | ... | N13988 | EWR | SAV | 95.0 | 708 | 16 | 49 | 2013-09-30T20:00:00Z | 3.0 | 0.010440 |
| 336757 | 2013 | 9 | 30 | 2159.0 | 1845 | 194.0 | 2344.0 | 2030 | 194.0 | 9E | ... | N906XJ | JFK | BUF | 50.0 | 301 | 18 | 45 | 2013-09-30T22:00:00Z | 2.0 | 0.002537 |
1108 rows × 21 columns
# Find all groups bigger than a threshold:
# Find all flights that fly to the popular destinations that appear in over 1000 times(flights).
flights.groupby('dest').filter(lambda x: x['dest'].count()>1000)
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | rank_daily_delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z | 288.0 |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z | 198.0 |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z | 122.0 |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z | 791.0 |
| 5 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z | 271.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z | NaN |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z | NaN |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z | NaN |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z | NaN |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z | NaN |
320366 rows × 20 columns
# Standardise to compute per group metrics:
# For all flights that arrived later than scheduled,
# calculate the proportion of arrival delay among delayed flights to the each destination
# display year, month, day, destination, flight, arr_delay, and proportion of arr_delay
flights['prop_delay'] = flights[flights.arr_delay>0].groupby('dest')['arr_delay'].transform(lambda x: x / x.sum())
flights[flights.arr_delay>0][['year', 'month', 'day', 'dest', 'flight', 'arr_delay', 'prop_delay']]
| year | month | day | dest | flight | arr_delay | prop_delay | |
|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | IAH | 1545 | 11.0 | 0.000111 |
| 1 | 2013 | 1 | 1 | IAH | 1714 | 20.0 | 0.000201 |
| 2 | 2013 | 1 | 1 | MIA | 1141 | 33.0 | 0.000235 |
| 5 | 2013 | 1 | 1 | ORD | 1696 | 12.0 | 0.000042 |
| 6 | 2013 | 1 | 1 | FLL | 507 | 19.0 | 0.000094 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 336759 | 2013 | 9 | 30 | BNA | 3660 | 7.0 | 0.000057 |
| 336760 | 2013 | 9 | 30 | STL | 4672 | 57.0 | 0.000717 |
| 336762 | 2013 | 9 | 30 | SFO | 471 | 42.0 | 0.000204 |
| 336763 | 2013 | 9 | 30 | MCO | 1083 | 130.0 | 0.000631 |
| 336768 | 2013 | 9 | 30 | BOS | 718 | 1.0 | 0.000005 |
133004 rows × 7 columns
import pandas as pd
import numpy as np
from nycflights13 import flights
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
Date of departure
Actual departure and arrival times (format HHMM or HMM), local tz.
Scheduled departure and arrival times (format HHMM or HMM), local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Time of scheduled departure broken into hour and minutes.
Two letter carrier abbreviation. See airlines() to get name
Plane tail number
Flight number
Origin and destination. See airports() for additional metadata.
Amount of time spent in the air, in minutes
Distance between airports, in miles
Scheduled date and hour of the flight as a date. Along with origin, can be used to join flights data to weather data.
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 336776 entries, 0 to 336775 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 336776 non-null int64 1 month 336776 non-null int64 2 day 336776 non-null int64 3 dep_time 328521 non-null float64 4 sched_dep_time 336776 non-null int64 5 dep_delay 328521 non-null float64 6 arr_time 328063 non-null float64 7 sched_arr_time 336776 non-null int64 8 arr_delay 327346 non-null float64 9 carrier 336776 non-null object 10 flight 336776 non-null int64 11 tailnum 334264 non-null object 12 origin 336776 non-null object 13 dest 336776 non-null object 14 air_time 327346 non-null float64 15 distance 336776 non-null int64 16 hour 336776 non-null int64 17 minute 336776 non-null int64 18 time_hour 336776 non-null object dtypes: float64(5), int64(9), object(5) memory usage: 48.8+ MB
Write python script to answer the following questions.
# Find all flights (identfied by 'carrier' and 'flight')
# that always depart at least 60 minutes late.
flights.query('dep_delay>=60').groupby(['carrier','flight']).size()
#( flights.groupby(['carrier','flight']).agg({'dep_delay':'min'}).query('dep_delay>=60').reset_index() )
carrier flight
9E 2900 4
2901 4
2902 2
2903 7
2904 7
..
YV 3775 1
3788 5
3790 1
3791 4
3799 1
Length: 3462, dtype: int64
# How many flights always depart at least 60 minutes late?
flights.query('dep_delay>=60').count()
#( flights.groupby(['carrier','flight']).agg({'dep_delay':'min'}).query('dep_delay>=60').reset_index() ).shape[0]
year 27059 month 27059 day 27059 dep_time 27059 sched_dep_time 27059 dep_delay 27059 arr_time 26902 sched_arr_time 27059 arr_delay 26802 carrier 27059 flight 27059 tailnum 27059 origin 27059 dest 27059 air_time 26802 distance 27059 hour 27059 minute 27059 time_hour 27059 calc 26802 dtype: int64
# Finda all United Airlines (UA) flights (identfied by 'carrier' and 'flight')
# that depart (at least) 60 minutes late.
(
flights
.query('dep_delay>=60' and "carrier=='UA'")
.groupby(['carrier','flight'])
.size()
)
#( flights[flights.carrier=='UA'].groupby(['carrier','flight']).agg({'dep_delay':'min'}).query('dep_delay>=60').reset_index() )
carrier flight
UA 1 3
3 6
10 18
12 8
15 365
...
1740 79
1741 6
1742 11
1743 3
1744 131
Length: 1285, dtype: int64
# How many United Airlines (UA) flights depart (at least) 60 minutes late?
(
flights
.query('dep_delay>=60' and "carrier=='UA'")
.count()
)
year 58665 month 58665 day 58665 dep_time 57979 sched_dep_time 58665 dep_delay 57979 arr_time 57916 sched_arr_time 58665 arr_delay 57782 carrier 58665 flight 58665 tailnum 57979 origin 58665 dest 58665 air_time 57782 distance 58665 hour 58665 minute 58665 time_hour 58665 calc 57782 dtype: int64
# Flights that always arrived on time in December?
flights.query('arr_delay==0' and "month==12")
#( flights[flights.month==12].groupby(['carrier','flight']).agg({'arr_delay':'max'}).query('arr_delay<=0').reset_index() )
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | calc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 83161 | 2013 | 12 | 1 | 13.0 | 2359 | 14.0 | 446.0 | 445 | 1.0 | B6 | 745 | N715JB | JFK | PSE | 195.0 | 1617 | 23 | 59 | 2013-12-02T04:00:00Z | 1.089385 |
| 83162 | 2013 | 12 | 1 | 17.0 | 2359 | 18.0 | 443.0 | 437 | 6.0 | B6 | 839 | N593JB | JFK | BQN | 186.0 | 1576 | 23 | 59 | 2013-12-02T04:00:00Z | 1.075145 |
| 83163 | 2013 | 12 | 1 | 453.0 | 500 | -7.0 | 636.0 | 651 | -15.0 | US | 1895 | N197UW | EWR | CLT | 86.0 | 529 | 5 | 0 | 2013-12-01T10:00:00Z | 1.343750 |
| 83164 | 2013 | 12 | 1 | 520.0 | 515 | 5.0 | 749.0 | 808 | -19.0 | UA | 1487 | N69804 | EWR | IAH | 193.0 | 1400 | 5 | 15 | 2013-12-01T10:00:00Z | 1.198758 |
| 83165 | 2013 | 12 | 1 | 536.0 | 540 | -4.0 | 845.0 | 850 | -5.0 | AA | 2243 | N634AA | JFK | MIA | 144.0 | 1089 | 5 | 40 | 2013-12-01T10:00:00Z | 1.152000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 111291 | 2013 | 12 | 31 | NaN | 705 | NaN | NaN | 931 | NaN | UA | 1729 | NaN | EWR | DEN | NaN | 1605 | 7 | 5 | 2013-12-31T12:00:00Z | NaN |
| 111292 | 2013 | 12 | 31 | NaN | 825 | NaN | NaN | 1029 | NaN | US | 1831 | NaN | JFK | CLT | NaN | 541 | 8 | 25 | 2013-12-31T13:00:00Z | NaN |
| 111293 | 2013 | 12 | 31 | NaN | 1615 | NaN | NaN | 1800 | NaN | MQ | 3301 | N844MQ | LGA | RDU | NaN | 431 | 16 | 15 | 2013-12-31T21:00:00Z | NaN |
| 111294 | 2013 | 12 | 31 | NaN | 600 | NaN | NaN | 735 | NaN | UA | 219 | NaN | EWR | ORD | NaN | 719 | 6 | 0 | 2013-12-31T11:00:00Z | NaN |
| 111295 | 2013 | 12 | 31 | NaN | 830 | NaN | NaN | 1154 | NaN | UA | 443 | NaN | JFK | LAX | NaN | 2475 | 8 | 30 | 2013-12-31T13:00:00Z | NaN |
28135 rows × 20 columns
# How many flights always arrived on time in December?
flights.query('arr_delay==0' and "month==12").count()
#( flights.groupby(['carrier','flight']).agg({'dep_delay':'min'}).query('dep_delay>=60').reset_index() ).shape[0]
year 28135 month 28135 day 28135 dep_time 27110 sched_dep_time 28135 dep_delay 27110 arr_time 27076 sched_arr_time 28135 arr_delay 27020 carrier 28135 flight 28135 tailnum 27865 origin 28135 dest 28135 air_time 27020 distance 28135 hour 28135 minute 28135 time_hour 28135 calc 27020 dtype: int64
# Create a dataframe 'cancelled' for all cancelled flights.
# Assumption: flights that are not cancelled should have values for dep_time.
cancelled=flights[flights.dep_delay.isnull() & flights.arr_delay.isnull()]
cancelled
#cancelled.shape
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | calc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 838 | 2013 | 1 | 1 | NaN | 1630 | NaN | NaN | 1815 | NaN | EV | 4308 | N18120 | EWR | RDU | NaN | 416 | 16 | 30 | 2013-01-01T21:00:00Z | NaN |
| 839 | 2013 | 1 | 1 | NaN | 1935 | NaN | NaN | 2240 | NaN | AA | 791 | N3EHAA | LGA | DFW | NaN | 1389 | 19 | 35 | 2013-01-02T00:00:00Z | NaN |
| 840 | 2013 | 1 | 1 | NaN | 1500 | NaN | NaN | 1825 | NaN | AA | 1925 | N3EVAA | LGA | MIA | NaN | 1096 | 15 | 0 | 2013-01-01T20:00:00Z | NaN |
| 841 | 2013 | 1 | 1 | NaN | 600 | NaN | NaN | 901 | NaN | B6 | 125 | N618JB | JFK | FLL | NaN | 1069 | 6 | 0 | 2013-01-01T11:00:00Z | NaN |
| 1777 | 2013 | 1 | 2 | NaN | 1540 | NaN | NaN | 1747 | NaN | EV | 4352 | N10575 | EWR | CVG | NaN | 569 | 15 | 40 | 2013-01-02T20:00:00Z | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z | NaN |
| 336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z | NaN |
| 336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z | NaN |
| 336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z | NaN |
| 336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z | NaN |
8255 rows × 20 columns
# Look at the number of cancelled flights per day.
# Find the top 10 days with the most cancelled flights.
(
cancelled
.groupby(['year','month','day'])
.agg({'flight':'count'})
.reset_index()
.sort_values('flight', ascending=False)
.head(10)
)
| year | month | day | flight | |
|---|---|---|---|---|
| 38 | 2013 | 2 | 8 | 472 |
| 39 | 2013 | 2 | 9 | 393 |
| 140 | 2013 | 5 | 23 | 221 |
| 336 | 2013 | 12 | 10 | 204 |
| 251 | 2013 | 9 | 12 | 192 |
| 64 | 2013 | 3 | 6 | 180 |
| 66 | 2013 | 3 | 8 | 180 |
| 331 | 2013 | 12 | 5 | 158 |
| 340 | 2013 | 12 | 14 | 125 |
| 199 | 2013 | 7 | 22 | 123 |
# Which carrier has the worst delays?
(
flights
.groupby('carrier')
.agg({'arr_delay':'mean','dep_delay': 'mean'}) # mean/max
.sort_values('arr_delay' and 'dep_delay',ascending=False)
.reset_index()
)
| carrier | arr_delay | dep_delay | |
|---|---|---|---|
| 0 | F9 | 21.920705 | 20.215543 |
| 1 | EV | 15.796431 | 19.955390 |
| 2 | YV | 15.556985 | 18.996330 |
| 3 | FL | 20.115906 | 18.726075 |
| 4 | WN | 9.649120 | 17.711744 |
| 5 | 9E | 7.379669 | 16.725769 |
| 6 | B6 | 9.457973 | 13.022522 |
| 7 | VX | 1.764464 | 12.869421 |
| 8 | OO | 11.931034 | 12.586207 |
| 9 | UA | 3.558011 | 12.106073 |
| 10 | MQ | 10.774733 | 10.552041 |
| 11 | DL | 1.644341 | 9.264505 |
| 12 | AA | 0.364291 | 8.586016 |
| 13 | AS | -9.930889 | 5.804775 |
| 14 | HA | -6.915205 | 4.900585 |
| 15 | US | 2.129595 | 3.782418 |
# Which plane (tailnum) has the worst on-time record?
(
flights
.groupby('tailnum') #.dep_delay # .agg(['min','mean','max'])
.agg({'arr_delay':'mean','dep_delay': 'mean', 'flight':'count'})
.sort_values('arr_delay' and 'dep_delay',ascending=False)
.reset_index()
)
| tailnum | arr_delay | dep_delay | flight | |
|---|---|---|---|---|
| 0 | N844MH | 320.0 | 297.0 | 1 |
| 1 | N922EV | 276.0 | 274.0 | 1 |
| 2 | N587NW | 264.0 | 272.0 | 1 |
| 3 | N911DA | 294.0 | 268.0 | 1 |
| 4 | N851NW | 219.0 | 233.0 | 1 |
| ... | ... | ... | ... | ... |
| 4038 | N728SK | NaN | NaN | 1 |
| 4039 | N768SK | NaN | NaN | 1 |
| 4040 | N862DA | NaN | NaN | 1 |
| 4041 | N865DA | NaN | NaN | 1 |
| 4042 | N939DN | NaN | NaN | 1 |
4043 rows × 4 columns
# What time (hour) of day should you fly if you want to avoid delays as much as possible?
(
flights
.groupby('hour') #.dep_delay #.agg(['min','mean','max'])
.agg({'arr_delay':'mean','dep_delay': 'mean', 'flight':'count'})
.sort_values('flight')
.reset_index()
)
| hour | arr_delay | dep_delay | flight | |
|---|---|---|---|---|
| 0 | 1 | NaN | NaN | 1 |
| 1 | 23 | 11.755278 | 14.017176 | 1061 |
| 2 | 5 | -4.796907 | 0.687757 | 1953 |
| 3 | 22 | 15.967162 | 18.791097 | 2639 |
| 4 | 21 | 18.386937 | 24.195743 | 10933 |
| 5 | 11 | 1.481930 | 7.191650 | 16033 |
| 6 | 10 | 0.953940 | 6.498295 | 16708 |
| 7 | 20 | 16.676110 | 24.304105 | 16739 |
| 8 | 12 | 3.489010 | 8.614849 | 18181 |
| 9 | 13 | 6.544740 | 11.437650 | 19956 |
| 10 | 9 | -1.451407 | 4.583738 | 20312 |
| 11 | 19 | 16.655874 | 24.784791 | 21441 |
| 12 | 14 | 9.197650 | 13.818874 | 21706 |
| 13 | 18 | 14.788724 | 21.110082 | 21783 |
| 14 | 7 | -5.304472 | 1.914078 | 22821 |
| 15 | 16 | 12.597641 | 18.757017 | 23002 |
| 16 | 15 | 12.324192 | 16.894565 | 23888 |
| 17 | 17 | 16.040267 | 21.100606 | 24426 |
| 18 | 6 | -3.384485 | 1.642796 | 25951 |
| 19 | 8 | -1.113227 | 4.127948 | 27242 |
# For each destination, compute the total minutes of arrival delay.
(
flights
.groupby('dest')
.agg({'arr_delay':'sum'})
.sort_values('arr_delay')
.reset_index()
)
| dest | arr_delay | |
|---|---|---|
| 0 | SNA | -6389.0 |
| 1 | SEA | -4270.0 |
| 2 | STT | -1987.0 |
| 3 | HNL | -957.0 |
| 4 | PSP | -229.0 |
| ... | ... | ... |
| 100 | DCA | 82609.0 |
| 101 | FLL | 96153.0 |
| 102 | ORD | 97352.0 |
| 103 | CLT | 100645.0 |
| 104 | ATL | 190260.0 |
105 rows × 2 columns
# Optional #######
# For each flight ('carrier' + 'flight'), compute the proportion of the total delay for its destination.
delayed = (
flights
[flights.arr_delay>0]
.groupby(['dest','carrier','flight'])
.agg({'arr_delay':'sum'})
.reset_index()
)
delayed['arr_delay_prop'] = (
delayed
.groupby('dest')
.arr_delay
.transform(lambda x: x / x.sum())
)
delayed
| dest | carrier | flight | arr_delay | arr_delay_prop | |
|---|---|---|---|---|---|
| 0 | ABQ | B6 | 65 | 1943.0 | 0.433029 |
| 1 | ABQ | B6 | 1505 | 2544.0 | 0.566971 |
| 2 | ACK | B6 | 1191 | 1413.0 | 0.475118 |
| 3 | ACK | B6 | 1195 | 62.0 | 0.020847 |
| 4 | ACK | B6 | 1291 | 267.0 | 0.089778 |
| ... | ... | ... | ... | ... | ... |
| 8589 | XNA | MQ | 3553 | 415.0 | 0.026693 |
| 8590 | XNA | MQ | 3713 | 895.0 | 0.057567 |
| 8591 | XNA | MQ | 4413 | 1763.0 | 0.113398 |
| 8592 | XNA | MQ | 4525 | 2024.0 | 0.130186 |
| 8593 | XNA | MQ | 4534 | 1000.0 | 0.064321 |
8594 rows × 5 columns
Grouping is most useful in conjunction with aggregate functions. But you can also do other operations within groups:
# Find the worst days in terms of average arrival delay greater than 1 hour:
(
flights
.query('arr_delay>60')
.groupby('day') #['year','month','day']
.agg({'arr_delay':'mean'})
.sort_values('arr_delay',ascending=False)
.reset_index()
)
| day | arr_delay | |
|---|---|---|
| 0 | 5 | 134.574349 |
| 1 | 10 | 132.541020 |
| 2 | 28 | 131.955224 |
| 3 | 8 | 129.912909 |
| 4 | 27 | 128.016949 |
| 5 | 2 | 127.688913 |
| 6 | 18 | 126.857143 |
| 7 | 24 | 125.019011 |
| 8 | 23 | 123.797391 |
| 9 | 12 | 123.531915 |
| 10 | 22 | 122.893398 |
| 11 | 19 | 121.938815 |
| 12 | 7 | 121.843384 |
| 13 | 30 | 120.381223 |
| 14 | 25 | 119.922031 |
| 15 | 17 | 119.057930 |
| 16 | 1 | 118.751210 |
| 17 | 13 | 117.461170 |
| 18 | 11 | 116.133452 |
| 19 | 9 | 115.361842 |
| 20 | 16 | 115.114566 |
| 21 | 14 | 115.089153 |
| 22 | 3 | 115.065772 |
| 23 | 31 | 113.563131 |
| 24 | 21 | 111.875648 |
| 25 | 20 | 111.383275 |
| 26 | 6 | 111.055319 |
| 27 | 29 | 110.785300 |
| 28 | 26 | 108.919034 |
| 29 | 15 | 108.122407 |
| 30 | 4 | 106.293850 |
# Look at each destination. Can you find flights that are suspiciously slow?
# (i.e. flights that represent a potential data entry error).
# Compute the air time of a flight relative to the shortest flight to that destination.
flights['calc']=flights[flights.air_time>0].groupby('dest')['air_time'].transform(lambda x: x / x.min())
flights[flights.air_time>0][['dest','calc','air_time']]
# flights['air_time_ratio'] = (
#
# flights
# .groupby(['origin','dest'])
# .air_time
# .transform(lambda x: x / x.mean())
#)
# flights.sort_values('air_time_ratio',ascending=False).head(10)
| dest | calc | air_time | |
|---|---|---|---|
| 0 | IAH | 1.409938 | 227.0 |
| 1 | IAH | 1.409938 | 227.0 |
| 2 | MIA | 1.280000 | 160.0 |
| 3 | BQN | 1.057803 | 183.0 |
| 4 | ATL | 1.784615 | 116.0 |
| ... | ... | ... | ... |
| 336765 | SYR | 1.366667 | 41.0 |
| 336766 | BUF | 1.368421 | 52.0 |
| 336767 | ROC | 1.342857 | 47.0 |
| 336768 | BOS | 1.571429 | 33.0 |
| 336769 | PSE | 1.094972 | 196.0 |
327346 rows × 3 columns
# Find the number of carriers between each pair of origin and destination.
#df1=(
(
flights
.groupby(['origin','dest']) #['carrier']
.agg({'carrier':'count'})
.reset_index()
)
# (df1.groupby(['origin','dest']).agg({'carrier':'count'}).reset_index())
| origin | dest | carrier | |
|---|---|---|---|
| 0 | EWR | ALB | 439 |
| 1 | EWR | ANC | 8 |
| 2 | EWR | ATL | 5022 |
| 3 | EWR | AUS | 968 |
| 4 | EWR | AVL | 265 |
| ... | ... | ... | ... |
| 219 | LGA | SYR | 293 |
| 220 | LGA | TPA | 2145 |
| 221 | LGA | TVC | 77 |
| 222 | LGA | TYS | 308 |
| 223 | LGA | XNA | 745 |
224 rows × 3 columns
# Optional #######
# For each destination, rank the carriers by the average departure delay.
dest_carriers = (
flights
.groupby(['dest','carrier'])
.agg({'dep_delay':'mean'})
.reset_index()
)
# rank carriers within dest group
dest_carriers['rank_by_delay']= (
dest_carriers
.groupby(['dest'])
['dep_delay']
.rank(method='min')
)
dest_carriers.sort_values(['dest','rank_by_delay'])
| dest | carrier | dep_delay | rank_by_delay | |
|---|---|---|---|---|
| 0 | ABQ | B6 | 13.740157 | 1.0 |
| 1 | ACK | B6 | 6.456604 | 1.0 |
| 2 | ALB | EV | 23.620525 | 1.0 |
| 3 | ANC | UA | 12.875000 | 1.0 |
| 4 | ATL | 9E | 0.964912 | 1.0 |
| ... | ... | ... | ... | ... |
| 308 | TVC | EV | 27.476190 | 2.0 |
| 310 | TYS | 9E | 12.705660 | 1.0 |
| 311 | TYS | EV | 41.818471 | 2.0 |
| 313 | XNA | MQ | 5.843923 | 1.0 |
| 312 | XNA | EV | 8.031359 | 2.0 |
314 rows × 4 columns
# Optional #######
# For each plane (tailnum), count the number of
# flights that arrived late for over than 1 hour.
flights['delay_1hr'] = (flights.arr_delay>60)
(
flights
[flights.arr_delay>60]
.groupby('tailnum')
.agg({'delay_1hr':'count'})
.reset_index()
)
| tailnum | delay_1hr | |
|---|---|---|
| 0 | D942DN | 1 |
| 1 | N0EGMQ | 30 |
| 2 | N10156 | 15 |
| 3 | N102UW | 2 |
| 4 | N104UW | 4 |
| ... | ... | ... |
| 3366 | N997DL | 5 |
| 3367 | N998AT | 3 |
| 3368 | N998DL | 8 |
| 3369 | N999DN | 6 |
| 3370 | N9EAMQ | 21 |
3371 rows × 2 columns
flights['delay_1hr'] = (flights.arr_delay>60)
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | delay_1hr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z | False |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z | False |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z | False |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z | False |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z | False |
In the next few assignments, you will be working with this data set of IMDB top 1000 movies.
Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
import pandas as pd
import numpy as np
# Read the data file "imdb_top_1000.csv" to a dataframe named "imdb"
imdb = pd.read_csv('../data/imdb_top_1000.csv', header=0)
imdb.head()
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28,341,469 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134,966,411 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534,858,444 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57,300,000 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4,360,000 |
# Describe the dataframe using the info() method.
imdb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Poster_Link 1000 non-null object 1 Series_Title 1000 non-null object 2 Released_Year 1000 non-null object 3 Certificate 899 non-null object 4 Runtime 1000 non-null object 5 Genre 1000 non-null object 6 IMDB_Rating 1000 non-null float64 7 Overview 1000 non-null object 8 Meta_score 843 non-null float64 9 Director 1000 non-null object 10 Star1 1000 non-null object 11 Star2 1000 non-null object 12 Star3 1000 non-null object 13 Star4 1000 non-null object 14 No_of_Votes 1000 non-null int64 15 Gross 831 non-null object dtypes: float64(2), int64(1), object(13) memory usage: 125.1+ KB
# List all the column names:
imdb.columns
Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
dtype='object')
# Display top 10 movies's title, released year and IMDB rating.
imdb[['Series_Title','Released_Year','IMDB_Rating']].head(10)
| Series_Title | Released_Year | IMDB_Rating | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | 9.3 |
| 1 | The Godfather | 1972 | 9.2 |
| 2 | The Dark Knight | 2008 | 9.0 |
| 3 | The Godfather: Part II | 1974 | 9.0 |
| 4 | 12 Angry Men | 1957 | 9.0 |
| 5 | The Lord of the Rings: The Return of the King | 2003 | 8.9 |
| 6 | Pulp Fiction | 1994 | 8.9 |
| 7 | Schindler's List | 1993 | 8.9 |
| 8 | Inception | 2010 | 8.8 |
| 9 | Fight Club | 1999 | 8.8 |
# Display moviess ranked 11-20.
# Show their title, released year and IMDB rating.
imdb.iloc[11:21,[1,2,6]]
| Series_Title | Released_Year | IMDB_Rating | |
|---|---|---|---|
| 11 | Forrest Gump | 1994 | 8.8 |
| 12 | Il buono, il brutto, il cattivo | 1966 | 8.8 |
| 13 | The Lord of the Rings: The Two Towers | 2002 | 8.7 |
| 14 | The Matrix | 1999 | 8.7 |
| 15 | Goodfellas | 1990 | 8.7 |
| 16 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 8.7 |
| 17 | One Flew Over the Cuckoo's Nest | 1975 | 8.7 |
| 18 | Hamilton | 2020 | 8.6 |
| 19 | Gisaengchung | 2019 | 8.6 |
| 20 | Soorarai Pottru | 2020 | 8.6 |
# Select all movies directed by Quentin Tarantino.
# Show their title, released year, IMDB rating, and gross.
imdb.loc[imdb['Director']=='Quentin Tarantino'
,['Series_Title','Released_Year','IMDB_Rating','Gross']]
| Series_Title | Released_Year | IMDB_Rating | Gross | |
|---|---|---|---|---|
| 6 | Pulp Fiction | 1994 | 8.9 | 107,928,762 |
| 62 | Django Unchained | 2012 | 8.4 | 162,805,434 |
| 93 | Inglourious Basterds | 2009 | 8.3 | 120,540,719 |
| 103 | Reservoir Dogs | 1992 | 8.3 | 2,832,029 |
| 241 | Kill Bill: Vol. 1 | 2003 | 8.1 | 70,099,045 |
| 369 | Kill Bill: Vol. 2 | 2004 | 8.0 | 66,208,183 |
| 584 | The Hateful Eight | 2015 | 7.8 | 54,117,416 |
| 879 | Once Upon a Time... in Hollywood | 2019 | 7.6 | 142,502,728 |
# Select all R rated movies with IMDB_Rating>=8.5
# Show their title, released year, certificate, and IMDB rating.
imdb.loc[(imdb['Certificate']=='R')&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','IMDB_Rating']]
| Series_Title | Released_Year | Certificate | IMDB_Rating | |
|---|---|---|---|---|
| 24 | Saving Private Ryan | 1998 | R | 8.6 |
| 38 | The Pianist | 2002 | R | 8.5 |
| 40 | American History X | 1998 | R | 8.5 |
# How many unique values are there in the column "Released_Year"?
# Hint: nuniuqe()
imdb['Released_Year'].nunique()
100
# Count the number of movies in each "Released_Year"?
# Hint: value_counts()
imdb['Released_Year'].value_counts()
2014 32
2004 31
2009 29
2013 28
2016 28
..
1926 1
1936 1
1924 1
1921 1
PG 1
Name: Released_Year, Length: 100, dtype: int64
# In this dataset, there is a movie with an error in "Released_Year".
# Hint: Released_Year should be a 4-digit integer but this movie's is not.
# Find this movie.
imdb.sort_values('Released_Year',ascending=False)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 966 | https://m.media-amazon.com/images/M/MV5BNjEzYj... | Apollo 13 | PG | U | 140 min | Adventure, Drama, History | 7.6 | NASA must devise a strategy to return Apollo 1... | 77.0 | Ron Howard | Tom Hanks | Bill Paxton | Kevin Bacon | Gary Sinise | 269197 | 173,837,933 |
| 20 | https://m.media-amazon.com/images/M/MV5BOTc2ZT... | Soorarai Pottru | 2020 | U | 153 min | Drama | 8.6 | Nedumaaran Rajangam "Maara" sets out to make t... | NaN | Sudha Kongara | Suriya | Madhavan | Paresh Rawal | Aparna Balamurali | 54995 | NaN |
| 612 | https://m.media-amazon.com/images/M/MV5BYjYzOG... | The Trial of the Chicago 7 | 2020 | R | 129 min | Drama, History, Thriller | 7.8 | The story of 7 people on trial stemming from v... | 77.0 | Aaron Sorkin | Eddie Redmayne | Alex Sharp | Sacha Baron Cohen | Jeremy Strong | 89896 | NaN |
| 613 | https://m.media-amazon.com/images/M/MV5BOTNjM2... | Druk | 2020 | NaN | 117 min | Comedy, Drama | 7.8 | Four friends, all high school teachers, test a... | 81.0 | Thomas Vinterberg | Mads Mikkelsen | Thomas Bo Larsen | Magnus Millang | Lars Ranthe | 33931 | NaN |
| 18 | https://m.media-amazon.com/images/M/MV5BNjViNW... | Hamilton | 2020 | PG-13 | 160 min | Biography, Drama, History | 8.6 | The real life of one of America's foremost fou... | 90.0 | Thomas Kail | Lin-Manuel Miranda | Phillipa Soo | Leslie Odom Jr. | Renée Elise Goldsberry | 55291 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 193 | https://m.media-amazon.com/images/M/MV5BZjEyOT... | The Gold Rush | 1925 | Passed | 95 min | Adventure, Comedy, Drama | 8.2 | A prospector goes to the Klondike in search of... | NaN | Charles Chaplin | Charles Chaplin | Mack Swain | Tom Murray | Henry Bergman | 101053 | 5,450,000 |
| 194 | https://m.media-amazon.com/images/M/MV5BZWFhOG... | Sherlock Jr. | 1924 | Passed | 45 min | Action, Comedy, Romance | 8.2 | A film projectionist longs to be a detective, ... | NaN | Buster Keaton | Buster Keaton | Kathryn McGuire | Joe Keaton | Erwin Connelly | 41985 | 977,375 |
| 568 | https://m.media-amazon.com/images/M/MV5BMTAxYj... | Nosferatu | 1922 | NaN | 94 min | Fantasy, Horror | 7.9 | Vampire Count Orlok expresses interest in a ne... | NaN | F.W. Murnau | Max Schreck | Alexander Granach | Gustav von Wangenheim | Greta Schröder | 88794 | NaN |
| 127 | https://m.media-amazon.com/images/M/MV5BZjhhMT... | The Kid | 1921 | Passed | 68 min | Comedy, Drama, Family | 8.3 | The Tramp cares for an abandoned child, but ev... | NaN | Charles Chaplin | Charles Chaplin | Edna Purviance | Jackie Coogan | Carl Miller | 113314 | 5,450,000 |
| 321 | https://m.media-amazon.com/images/M/MV5BNWJiNG... | Das Cabinet des Dr. Caligari | 1920 | NaN | 76 min | Fantasy, Horror, Mystery | 8.1 | Hypnotist Dr. Caligari uses a somnambulist, Ce... | NaN | Robert Wiene | Werner Krauss | Conrad Veidt | Friedrich Feher | Lil Dagover | 57428 | NaN |
1000 rows × 16 columns
# Correct the values for the corresponding columns ("Release_Year" and "Certificate").
# You may want to look up this movie on www.imdb.com.
# Hint: You can set value for a particular set by: df.loc[row_name, column_name] = new_value
imdb.loc[966,'Released_Year']=1995
imdb.loc[966,'Certificate']='PG'
imdb.iloc[966,:]
Poster_Link https://m.media-amazon.com/images/M/MV5BNjEzYj... Series_Title Apollo 13 Released_Year 1995 Certificate PG Runtime 140 min Genre Adventure, Drama, History IMDB_Rating 7.6 Overview NASA must devise a strategy to return Apollo 1... Meta_score 77.0 Director Ron Howard Star1 Tom Hanks Star2 Bill Paxton Star3 Kevin Bacon Star4 Gary Sinise No_of_Votes 269197 Gross 173,837,933 Name: 966, dtype: object
# Change the data type of "Released_Year" to int
# Panda Basic
imdb['Released_Year']=imdb['Released_Year'].apply(int)
imdb.dtypes
Poster_Link object Series_Title object Released_Year int64 Certificate object Runtime object Genre object IMDB_Rating float64 Overview object Meta_score float64 Director object Star1 object Star2 object Star3 object Star4 object No_of_Votes int64 Gross object dtype: object
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, certificate, and IMDB rating.
imdb.loc[(imdb['Released_Year']>=2010)&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','IMDB_Rating']]
| Series_Title | Released_Year | Certificate | IMDB_Rating | |
|---|---|---|---|---|
| 8 | Inception | 2010 | UA | 8.8 |
| 18 | Hamilton | 2020 | PG-13 | 8.6 |
| 19 | Gisaengchung | 2019 | A | 8.6 |
| 20 | Soorarai Pottru | 2020 | U | 8.6 |
| 21 | Interstellar | 2014 | UA | 8.6 |
| 33 | Joker | 2019 | A | 8.5 |
| 34 | Whiplash | 2014 | A | 8.5 |
| 35 | The Intouchables | 2011 | UA | 8.5 |
# Select all movies whose genres contain 'Animation'
imdb1=imdb.dropna()
imdb1[imdb1['Genre'].str.contains('Animation')]
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | https://m.media-amazon.com/images/M/MV5BMjlmZm... | Sen to Chihiro no kamikakushi | 2001 | U | 125 min | Animation, Adventure, Family | 8.6 | During her family's move to the suburbs, a sul... | 96.0 | Hayao Miyazaki | Daveigh Chase | Suzanne Pleshette | Miyu Irino | Rumi Hiiragi | 651376 | 10,055,859 |
| 43 | https://m.media-amazon.com/images/M/MV5BYTYxNG... | The Lion King | 1994 | U | 88 min | Animation, Adventure, Drama | 8.5 | Lion prince Simba and his father are targeted ... | 88.0 | Roger Allers | Rob Minkoff | Matthew Broderick | Jeremy Irons | James Earl Jones | 942045 | 422,783,777 |
| 56 | https://m.media-amazon.com/images/M/MV5BODRmZD... | Kimi no na wa. | 2016 | U | 106 min | Animation, Drama, Fantasy | 8.4 | Two strangers find themselves linked in a biza... | 79.0 | Makoto Shinkai | Ryûnosuke Kamiki | Mone Kamishiraishi | Ryô Narita | Aoi Yûki | 194838 | 5,017,246 |
| 58 | https://m.media-amazon.com/images/M/MV5BMjMwND... | Spider-Man: Into the Spider-Verse | 2018 | U | 117 min | Animation, Action, Adventure | 8.4 | Teen Miles Morales becomes the Spider-Man of h... | 87.0 | Bob Persichetti | Peter Ramsey | Rodney Rothman | Shameik Moore | Jake Johnson | 375110 | 190,241,310 |
| 61 | https://m.media-amazon.com/images/M/MV5BYjQ5Nj... | Coco | 2017 | U | 105 min | Animation, Adventure, Family | 8.4 | Aspiring musician Miguel, confronted with his ... | 81.0 | Lee Unkrich | Adrian Molina | Anthony Gonzalez | Gael García Bernal | Benjamin Bratt | 384171 | 209,726,015 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 906 | https://m.media-amazon.com/images/M/MV5BMTY3Nj... | Despicable Me | 2010 | U | 95 min | Animation, Comedy, Crime | 7.6 | When a criminal mastermind uses a trio of orph... | 72.0 | Pierre Coffin | Chris Renaud | Steve Carell | Jason Segel | Russell Brand | 500851 | 251,513,985 |
| 956 | https://m.media-amazon.com/images/M/MV5BODkxNG... | Mulan | 1998 | U | 88 min | Animation, Adventure, Family | 7.6 | To save her father from death in the army, a y... | 71.0 | Tony Bancroft | Barry Cook | Ming-Na Wen | Eddie Murphy | BD Wong | 256906 | 120,620,254 |
| 971 | https://m.media-amazon.com/images/M/MV5BMTY5Nj... | Omohide poro poro | 1991 | U | 118 min | Animation, Drama, Romance | 7.6 | A twenty-seven-year-old office worker travels ... | 90.0 | Isao Takahata | Miki Imai | Toshirô Yanagiba | Yoko Honna | Mayumi Izuka | 27071 | 453,243 |
| 976 | https://m.media-amazon.com/images/M/MV5BN2JlZT... | The Little Mermaid | 1989 | U | 83 min | Animation, Family, Fantasy | 7.6 | A mermaid princess makes a Faustian bargain in... | 88.0 | Ron Clements | John Musker | Jodi Benson | Samuel E. Wright | Rene Auberjonois | 237696 | 111,543,479 |
| 992 | https://m.media-amazon.com/images/M/MV5BMjAwMT... | The Jungle Book | 1967 | U | 78 min | Animation, Adventure, Family | 7.6 | Bagheera the Panther and Baloo the Bear have a... | 65.0 | Wolfgang Reitherman | Phil Harris | Sebastian Cabot | Louis Prima | Bruce Reitherman | 166409 | 141,843,612 |
63 rows × 16 columns
# Create a new dataframe called "stars" including the following columns:
# Series_Title, Released_Year, Star1, Star2, Star3, Star4
stars = imdb.filter(['Series_Title','Released_Year'
,'Star1','Star2','Star3','Star4'])
stars
| Series_Title | Released_Year | Star1 | Star2 | Star3 | Star4 | |
|---|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler |
| 1 | The Godfather | 1972 | Marlon Brando | Al Pacino | James Caan | Diane Keaton |
| 2 | The Dark Knight | 2008 | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine |
| 3 | The Godfather: Part II | 1974 | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton |
| 4 | 12 Angry Men | 1957 | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen |
| 996 | Giant | 1956 | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker |
| 997 | From Here to Eternity | 1953 | Burt Lancaster | Montgomery Clift | Deborah Kerr | Donna Reed |
| 998 | Lifeboat | 1944 | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix |
| 999 | The 39 Steps | 1935 | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle |
1000 rows × 6 columns
# Create a new dataframe called "genres" including the following columns:
# Series_Title, Released_Year, Genre.
genres = imdb.filter(['Series_Title','Released_Year','Genre'])
genres
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Drama |
| 1 | The Godfather | 1972 | Crime, Drama |
| 2 | The Dark Knight | 2008 | Action, Crime, Drama |
| 3 | The Godfather: Part II | 1974 | Crime, Drama |
| 4 | 12 Angry Men | 1957 | Crime, Drama |
| ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Comedy, Drama, Romance |
| 996 | Giant | 1956 | Drama, Western |
| 997 | From Here to Eternity | 1953 | Drama, Romance, War |
| 998 | Lifeboat | 1944 | Drama, War |
| 999 | The 39 Steps | 1935 | Crime, Mystery, Thriller |
1000 rows × 3 columns
# Sorting:
# Sort dataframe genres in ascending order of "Released_Year"
genres.sort_values('Released_Year')
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 321 | Das Cabinet des Dr. Caligari | 1920 | Fantasy, Horror, Mystery |
| 127 | The Kid | 1921 | Comedy, Drama, Family |
| 568 | Nosferatu | 1922 | Fantasy, Horror |
| 194 | Sherlock Jr. | 1924 | Action, Comedy, Romance |
| 193 | The Gold Rush | 1925 | Adventure, Comedy, Drama |
| ... | ... | ... | ... |
| 20 | Soorarai Pottru | 2020 | Drama |
| 205 | Soul | 2020 | Animation, Adventure, Comedy |
| 613 | Druk | 2020 | Comedy, Drama |
| 464 | Dil Bechara | 2020 | Comedy, Drama, Romance |
| 612 | The Trial of the Chicago 7 | 2020 | Drama, History, Thriller |
1000 rows × 3 columns
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
imdb.loc[(imdb['Released_Year']>=2010)&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','Gross']]
imdb.sort_values('Gross',ascending=False)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 738 | https://m.media-amazon.com/images/M/MV5BOTc3Nz... | Rockstar | 2011 | UA | 159 min | Drama, Music, Musical | 7.7 | Janardhan Jakhar chases his dreams of becoming... | NaN | Imtiaz Ali | Ranbir Kapoor | Nargis Fakhri | Shammi Kapoor | Kumud Mishra | 39501 | 985,912 |
| 682 | https://m.media-amazon.com/images/M/MV5BZDRkOW... | The Color Purple | 1985 | U | 154 min | Drama | 7.8 | A black Southern woman struggles to find her i... | 78.0 | Steven Spielberg | Danny Glover | Whoopi Goldberg | Oprah Winfrey | Margaret Avery | 78321 | 98,467,863 |
| 194 | https://m.media-amazon.com/images/M/MV5BZWFhOG... | Sherlock Jr. | 1924 | Passed | 45 min | Action, Comedy, Romance | 8.2 | A film projectionist longs to be a detective, ... | NaN | Buster Keaton | Buster Keaton | Kathryn McGuire | Joe Keaton | Erwin Connelly | 41985 | 977,375 |
| 748 | https://m.media-amazon.com/images/M/MV5BOGUyZD... | The Social Network | 2010 | UA | 120 min | Biography, Drama | 7.7 | As Harvard student Mark Zuckerberg creates the... | 95.0 | David Fincher | Jesse Eisenberg | Andrew Garfield | Justin Timberlake | Rooney Mara | 624982 | 96,962,694 |
| 7 | https://m.media-amazon.com/images/M/MV5BNDE4OT... | Schindler's List | 1993 | A | 195 min | Biography, Drama, History | 8.9 | In German-occupied Poland during World War II,... | 94.0 | Steven Spielberg | Liam Neeson | Ralph Fiennes | Ben Kingsley | Caroline Goodall | 1213505 | 96,898,818 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 993 | https://m.media-amazon.com/images/M/MV5BYTE4YW... | Blowup | 1966 | A | 111 min | Drama, Mystery, Thriller | 7.6 | A fashion photographer unknowingly captures a ... | 82.0 | Michelangelo Antonioni | David Hemmings | Vanessa Redgrave | Sarah Miles | John Castle | 56513 | NaN |
| 995 | https://m.media-amazon.com/images/M/MV5BNGEwMT... | Breakfast at Tiffany's | 1961 | A | 115 min | Comedy, Drama, Romance | 7.6 | A young New York socialite becomes interested ... | 76.0 | Blake Edwards | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen | 166544 | NaN |
| 996 | https://m.media-amazon.com/images/M/MV5BODk3Yj... | Giant | 1956 | G | 201 min | Drama, Western | 7.6 | Sprawling epic covering the life of a Texas ca... | 84.0 | George Stevens | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker | 34075 | NaN |
| 998 | https://m.media-amazon.com/images/M/MV5BZTBmMj... | Lifeboat | 1944 | NaN | 97 min | Drama, War | 7.6 | Several survivors of a torpedoed merchant ship... | 78.0 | Alfred Hitchcock | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix | 26471 | NaN |
| 999 | https://m.media-amazon.com/images/M/MV5BMTY5OD... | The 39 Steps | 1935 | NaN | 86 min | Crime, Mystery, Thriller | 7.6 | A man in London tries to help a counter-espion... | 93.0 | Alfred Hitchcock | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle | 51853 | NaN |
1000 rows × 16 columns
# Does the sorting result looks right to you? What's the problem?
# Numbers order are not sorting, because it is str type
# Resolve this problem of "Gross" and convert its data type to float
# Hint: You may find this webpage useful:
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-column-of-a-pandas-dataframe
imdb['Gross']=imdb['Gross'].str.replace(',','')
imdb['Gross']=imdb['Gross'].apply(float)
imdb.dtypes
Poster_Link object Series_Title object Released_Year int64 Certificate object Runtime object Genre object IMDB_Rating float64 Overview object Meta_score float64 Director object Star1 object Star2 object Star3 object Star4 object No_of_Votes int64 Gross float64 dtype: object
# Next, redo the sorting on Gross
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
imdb.loc[(imdb['Released_Year']>=2010)&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','Gross']]
imdb.sort_values('Gross',ascending=False)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 477 | https://m.media-amazon.com/images/M/MV5BOTAzOD... | Star Wars: Episode VII - The Force Awakens | 2015 | U | 138 min | Action, Adventure, Sci-Fi | 7.9 | As a new threat to the galaxy rises, Rey, a de... | 80.0 | J.J. Abrams | Daisy Ridley | John Boyega | Oscar Isaac | Domhnall Gleeson | 860823 | 936662225.0 |
| 59 | https://m.media-amazon.com/images/M/MV5BMTc5MD... | Avengers: Endgame | 2019 | UA | 181 min | Action, Adventure, Drama | 8.4 | After the devastating events of Avengers: Infi... | 78.0 | Anthony Russo | Joe Russo | Robert Downey Jr. | Chris Evans | Mark Ruffalo | 809955 | 858373000.0 |
| 623 | https://m.media-amazon.com/images/M/MV5BMTYwOT... | Avatar | 2009 | UA | 162 min | Action, Adventure, Fantasy | 7.8 | A paraplegic Marine dispatched to the moon Pan... | 83.0 | James Cameron | Sam Worthington | Zoe Saldana | Sigourney Weaver | Michelle Rodriguez | 1118998 | 760507625.0 |
| 60 | https://m.media-amazon.com/images/M/MV5BMjMxNj... | Avengers: Infinity War | 2018 | UA | 149 min | Action, Adventure, Sci-Fi | 8.4 | The Avengers and their allies must be willing ... | 68.0 | Anthony Russo | Joe Russo | Robert Downey Jr. | Chris Hemsworth | Mark Ruffalo | 834477 | 678815482.0 |
| 652 | https://m.media-amazon.com/images/M/MV5BMDdmZG... | Titanic | 1997 | UA | 194 min | Drama, Romance | 7.8 | A seventeen-year-old aristocrat falls in love ... | 75.0 | James Cameron | Leonardo DiCaprio | Kate Winslet | Billy Zane | Kathy Bates | 1046089 | 659325379.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 993 | https://m.media-amazon.com/images/M/MV5BYTE4YW... | Blowup | 1966 | A | 111 min | Drama, Mystery, Thriller | 7.6 | A fashion photographer unknowingly captures a ... | 82.0 | Michelangelo Antonioni | David Hemmings | Vanessa Redgrave | Sarah Miles | John Castle | 56513 | NaN |
| 995 | https://m.media-amazon.com/images/M/MV5BNGEwMT... | Breakfast at Tiffany's | 1961 | A | 115 min | Comedy, Drama, Romance | 7.6 | A young New York socialite becomes interested ... | 76.0 | Blake Edwards | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen | 166544 | NaN |
| 996 | https://m.media-amazon.com/images/M/MV5BODk3Yj... | Giant | 1956 | G | 201 min | Drama, Western | 7.6 | Sprawling epic covering the life of a Texas ca... | 84.0 | George Stevens | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker | 34075 | NaN |
| 998 | https://m.media-amazon.com/images/M/MV5BZTBmMj... | Lifeboat | 1944 | NaN | 97 min | Drama, War | 7.6 | Several survivors of a torpedoed merchant ship... | 78.0 | Alfred Hitchcock | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix | 26471 | NaN |
| 999 | https://m.media-amazon.com/images/M/MV5BMTY5OD... | The 39 Steps | 1935 | NaN | 86 min | Crime, Mystery, Thriller | 7.6 | A man in London tries to help a counter-espion... | 93.0 | Alfred Hitchcock | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle | 51853 | NaN |
1000 rows × 16 columns
# Add a new column "Runtime_min" by removing the substring ' min" in "Runtime"
# Set its data type as int
# Hint: https://stackoverflow.com/questions/36505847/substring-of-an-entire-column-in-pandas-dataframe
imdb['Runtime_min']=imdb['Runtime'].str.replace('min','')
imdb['Runtime_min']=imdb['Runtime_min'].apply(int)
imdb.dtypes
Poster_Link object Series_Title object Released_Year int64 Certificate object Runtime object Genre object IMDB_Rating float64 Overview object Meta_score float64 Director object Star1 object Star2 object Star3 object Star4 object No_of_Votes int64 Gross float64 Runtime_min int64 dtype: object
# Add a new column "Age_Year" by expression: 2021 - Released_Year
imdb['Age_Year'] = 2021 - imdb.Released_Year
imdb.head(3)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | Runtime_min | Age_Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28341469.0 | 142 | 27 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134966411.0 | 175 | 49 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534858444.0 | 152 | 13 |
# Add a new column "Decade" with values as 1980, 1990, 2000, 2010, 2020, etc.
imdb['Decade']=imdb.Released_Year//10*10
imdb.head(10)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | Runtime_min | Age_Year | Decade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28341469.0 | 142 | 27 | 1990 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134966411.0 | 175 | 49 | 1970 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534858444.0 | 152 | 13 | 2000 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57300000.0 | 202 | 47 | 1970 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4360000.0 | 96 | 64 | 1950 |
| 5 | https://m.media-amazon.com/images/M/MV5BNzA5ZD... | The Lord of the Rings: The Return of the King | 2003 | U | 201 min | Action, Adventure, Drama | 8.9 | Gandalf and Aragorn lead the World of Men agai... | 94.0 | Peter Jackson | Elijah Wood | Viggo Mortensen | Ian McKellen | Orlando Bloom | 1642758 | 377845905.0 | 201 | 18 | 2000 |
| 6 | https://m.media-amazon.com/images/M/MV5BNGNhMD... | Pulp Fiction | 1994 | A | 154 min | Crime, Drama | 8.9 | The lives of two mob hitmen, a boxer, a gangst... | 94.0 | Quentin Tarantino | John Travolta | Uma Thurman | Samuel L. Jackson | Bruce Willis | 1826188 | 107928762.0 | 154 | 27 | 1990 |
| 7 | https://m.media-amazon.com/images/M/MV5BNDE4OT... | Schindler's List | 1993 | A | 195 min | Biography, Drama, History | 8.9 | In German-occupied Poland during World War II,... | 94.0 | Steven Spielberg | Liam Neeson | Ralph Fiennes | Ben Kingsley | Caroline Goodall | 1213505 | 96898818.0 | 195 | 28 | 1990 |
| 8 | https://m.media-amazon.com/images/M/MV5BMjAxMz... | Inception | 2010 | UA | 148 min | Action, Adventure, Sci-Fi | 8.8 | A thief who steals corporate secrets through t... | 74.0 | Christopher Nolan | Leonardo DiCaprio | Joseph Gordon-Levitt | Elliot Page | Ken Watanabe | 2067042 | 292576195.0 | 148 | 11 | 2010 |
| 9 | https://m.media-amazon.com/images/M/MV5BMmEzNT... | Fight Club | 1999 | A | 139 min | Drama | 8.8 | An insomniac office worker and a devil-may-car... | 66.0 | David Fincher | Brad Pitt | Edward Norton | Meat Loaf | Zach Grenier | 1854740 | 37030102.0 | 139 | 22 | 1990 |
# Total "Gross" of all top 1000 movies
(
imdb
.groupby('Series_Title')
.agg({'Gross':'sum'})
.reset_index()
)
| Series_Title | Gross | |
|---|---|---|
| 0 | (500) Days of Summer | 32391374.0 |
| 1 | 12 Angry Men | 4360000.0 |
| 2 | 12 Years a Slave | 56671993.0 |
| 3 | 1917 | 159227644.0 |
| 4 | 2001: A Space Odyssey | 56954992.0 |
| ... | ... | ... |
| 994 | Zootopia | 341268248.0 |
| 995 | Zulu | 0.0 |
| 996 | Zwartboek | 4398392.0 |
| 997 | À bout de souffle | 336705.0 |
| 998 | Ôkami kodomo no Ame to Yuki | 0.0 |
999 rows × 2 columns
# Average "No_of_Votes" of all movies
(
imdb
.groupby('Series_Title')
.agg({'No_of_Votes':'mean'})
.reset_index()
)
| Series_Title | No_of_Votes | |
|---|---|---|
| 0 | (500) Days of Summer | 472242.0 |
| 1 | 12 Angry Men | 689845.0 |
| 2 | 12 Years a Slave | 640533.0 |
| 3 | 1917 | 425844.0 |
| 4 | 2001: A Space Odyssey | 603517.0 |
| ... | ... | ... |
| 994 | Zootopia | 434143.0 |
| 995 | Zulu | 35999.0 |
| 996 | Zwartboek | 72643.0 |
| 997 | À bout de souffle | 73251.0 |
| 998 | Ôkami kodomo no Ame to Yuki | 38803.0 |
999 rows × 2 columns
# Count movies in each decade (e.g., ..., 1980, 1990, 2000, 2010, 2020)
# Sort decades by the number of movies in descending order
(
imdb
.groupby('Decade')
.agg({'Series_Title':'count'})
.reset_index()
.sort_values('Series_Title', ascending=False)
)
| Decade | Series_Title | |
|---|---|---|
| 9 | 2010 | 242 |
| 8 | 2000 | 237 |
| 7 | 1990 | 151 |
| 6 | 1980 | 89 |
| 5 | 1970 | 76 |
| 4 | 1960 | 73 |
| 3 | 1950 | 56 |
| 2 | 1940 | 35 |
| 1 | 1930 | 24 |
| 0 | 1920 | 11 |
| 10 | 2020 | 6 |
# Count movies by different directors.
# Show the top 10 directors with the most movies in this list
(
imdb
.groupby('Director')
.agg({'Series_Title':'count'})
.reset_index()
.sort_values('Series_Title',ascending=False)
.head(10)
)
| Director | Series_Title | |
|---|---|---|
| 22 | Alfred Hitchcock | 14 |
| 470 | Steven Spielberg | 13 |
| 179 | Hayao Miyazaki | 11 |
| 313 | Martin Scorsese | 10 |
| 9 | Akira Kurosawa | 10 |
| 463 | Stanley Kubrick | 9 |
| 532 | Woody Allen | 9 |
| 49 | Billy Wilder | 9 |
| 391 | Quentin Tarantino | 8 |
| 83 | Christopher Nolan | 8 |
# For each director, calculate the number of movies, average IMDB_Rating, and total gross.
# Ranked in descending order of total gross
# Show the top 10 directors with the most gross
(
imdb
.groupby('Director')
.agg({'Series_Title':'count'
,'IMDB_Rating':'mean'
,'Gross':'sum'
})
.reset_index()
.sort_values('Gross', ascending=False)
.head(10)
)
| Director | Series_Title | IMDB_Rating | Gross | |
|---|---|---|---|---|
| 470 | Steven Spielberg | 13 | 8.030769 | 2.478133e+09 |
| 36 | Anthony Russo | 4 | 8.075000 | 2.205039e+09 |
| 83 | Christopher Nolan | 8 | 8.462500 | 1.937454e+09 |
| 202 | James Cameron | 5 | 8.080000 | 1.748237e+09 |
| 383 | Peter Jackson | 5 | 8.400000 | 1.597312e+09 |
| 195 | J.J. Abrams | 3 | 7.833333 | 1.423171e+09 |
| 58 | Brad Bird | 4 | 7.900000 | 1.099628e+09 |
| 426 | Robert Zemeckis | 5 | 8.120000 | 1.049446e+09 |
| 107 | David Yates | 3 | 7.800000 | 9.789537e+08 |
| 380 | Pete Docter | 4 | 8.125000 | 9.393821e+08 |
# Group movies by decade and director
# In each group (i.e., for each decade and each director),
# calculate the number of movies and average IMDB rating.
# Sort in descending order of movie count
(
imdb
.groupby(['Director','Decade'])
.agg({'Series_Title':'count'
,'IMDB_Rating':'mean'
})
.reset_index()
.sort_values('Series_Title', ascending=False)
.head(10)
)
| Director | Decade | Series_Title | IMDB_Rating | |
|---|---|---|---|---|
| 71 | Billy Wilder | 1950 | 6 | 8.133333 |
| 9 | Akira Kurosawa | 1950 | 5 | 8.260000 |
| 31 | Alfred Hitchcock | 1940 | 5 | 7.880000 |
| 120 | Clint Eastwood | 2000 | 5 | 7.940000 |
| 155 | Denis Villeneuve | 2010 | 5 | 7.980000 |
| 32 | Alfred Hitchcock | 1950 | 5 | 8.220000 |
| 645 | Steven Spielberg | 1980 | 5 | 7.980000 |
| 116 | Christopher Nolan | 2000 | 4 | 8.525000 |
| 611 | Sergio Leone | 1960 | 4 | 8.400000 |
| 729 | Woody Allen | 1980 | 4 | 7.800000 |
# Bonus Question 1:
# Find the top 3 highest rated movie of each year since 2010.
# Show their released year, ranking, title, and IMDB_Rating.
# Sort them in descending order of year and ascending order of ranking
# Bonus Question 2:
# Find all directors whose movies appeared in at least five different decades.
# Your result should include: director, decade, and the number of movies in the decade.
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# the following line enables the static images of the plot embedded in the notebook, default option
%matplotlib inline
# set different styles: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html
plt.style.use('seaborn') # we use seaborn style, which looks better than the default one
#plt.style.use('default')
Matplotlib has a traditional MATLAB-style interface and a more powerful object-oriented (OO) interface. Please refer to https://matplotlib.org/stable/tutorials/introductory/lifecycle.html if you want to learn more about the differences.
The main thing to remember is:
# how to display an image
from IPython.display import Image
Image('../img/matplotlib-components.png')
fig = plt.figure()
ax = plt.axes()
# numpy.linspace(start, stop, num=50): Return evenly spaced numbers over a specified interval.
x = np.linspace(0, 10, 50) # return 50 numbers between 0 and 10
# common functions: http://mathonweb.com/help_ebook/html/functions_4.htm
ax.plot(x, 2*x+3) # linear function, such as f(x) = 2*x + 3
ax.plot(x, x**2-5*x+3) # quadratic function, such as f(x) = x^2 - 5x + 3
fig.savefig('../img/first-simple-figure.png') # this is how you save the plot as an image
plt.plot(x, 2*x+3) # linear function, such as f(x) = 2*x + 3
plt.plot(x, x**2-5*x+3) # quadratic function, such as f(x) = x^2 - 5x + 3
[<matplotlib.lines.Line2D at 0x1478b91d8e0>]
# figsize=(width, height) specifies figure size in inches, default to (6.4, 4.8)
fig = plt.figure(figsize=(5,5))
ax = plt.axes()
ax.plot(x, x**3) # power function, such as f(x) = x^3
ax.plot(x, x**3 - 8*x**2 + 5) # polynomial function, such as f(x) = x^3 - 8x^2 + 5
ax.plot(x, 2**x) # Exponential functions, such as f(x) = 2^x
[<matplotlib.lines.Line2D at 0x1478d7abc70>]
plt.plot(x, x**3) # power function, such as f(x) = x^3
plt.plot(x, x**3 - 8*x**2 + 5) # polynomial function, such as f(x) = x^3 - 8x^2 + 5
plt.plot(x, 2**x) # Exponential functions, such as f(x) = 2^x
[<matplotlib.lines.Line2D at 0x1478d81d640>]
Create a line plot for the Sigmoid function having a characteristic "S"-shaped curve or sigmoid curve.
https://en.wikipedia.org/wiki/Sigmoid_function
$S(x)=\frac{1}{1+e^{-x}}=\frac{e^{x}}{e^{x}+1}$
# Add your code here:
fig = plt.figure(figsize=(5,5))
ax = plt.axes()
x = np.linspace (-10,10,50)
ax.plot(x,(1/(1+np.exp(-x))))
[<matplotlib.lines.Line2D at 0x1478d87ac40>]
Subplots are groups of smaller axes that can exist together within a single figure for each comparison.
# subplot(# of rows, # of cols, index): a 2x3 grid with 6 subplots is illustrated below:
Image('../img/subplot.png')
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax1.plot(x, x*2)
ax2.plot(x, x**2)
[<matplotlib.lines.Line2D at 0x1478d924a30>]
fig, ax = plt.subplots(2, 2, figsize=(15,15))
ax[0, 0].plot(x, x*2)
ax[1, 1].plot(x, x**2)
[<matplotlib.lines.Line2D at 0x1478d9a51c0>]
#subplot(ax): ax specifies the number of subplots and return a list of axes
fig, ax = plt.subplots(2)
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x))
[<matplotlib.lines.Line2D at 0x1478ebde490>]
# Plot the Normal Distribution
# you can include Latex inline by enclosing latex text with $, such as $\mu$
from scipy.stats import norm
fig, ax = plt.subplots(figsize=(10,10))
x = np.linspace(-10, 10, 1000)
mu = 0
sigma = 2
dist = norm(mu, sigma)
ax.plot(x, dist.pdf(x), label=f'$\mu={mu}, \sigma={sigma}$')
ax.set_title('Normal Distribution', fontsize=20)
ax.set_xlabel('$x$')
ax.set_ylabel('$p(x|\mu,\sigma)$')
ax.legend()
<matplotlib.legend.Legend at 0x1478b8d40d0>
Create a plot that overlays four normal distributions:
# Overlay multiple normal distributions in the same plot
# you can include Latex inline by enclosing latex text with $, such as $\mu$
fig = plt.figure(figsize=(10,10))
ax = plt.axes()
# Add your code here:
x = np.linspace(-10, 10, 1000)
#Line 1
mu = 0
sigma = 1
dist = norm(mu, sigma)
ax.plot(x, dist.pdf(x), label=f'$\mu={mu}, \sigma={sigma}$')
#Line 2
mu = 0
sigma = 2
dist = norm(mu, sigma)
ax.plot(x, dist.pdf(x), label=f'$\mu={mu}, \sigma={sigma}$')
#Line 3
mu = 0
sigma = 3
dist = norm(mu, sigma)
ax.plot(x, dist.pdf(x), label=f'$\mu={mu}, \sigma={sigma}$')
#Line 4
mu = 2
sigma = 1
dist = norm(mu, sigma)
ax.plot(x, dist.pdf(x), label=f'$\mu={mu}, \sigma={sigma}$')
ax.set_title('Normal Distribution', fontsize=20)
ax.set_xlabel('$x$')
ax.set_ylabel('$p(x|\mu,\sigma)$')
ax.legend()
<matplotlib.legend.Legend at 0x1478f4aa0d0>
# Put these four normal distrubitons in four subplots in a 2x2 figure
# Add your code here:
fig, ax = plt.subplots(2, 2)
x = np.linspace(-10, 10, 1000)
ax[0,0].plot(x,norm.pdf(x,0,1))
ax[0,1].plot(x,norm.pdf(x,0,2))
ax[1,0].plot(x,norm.pdf(x,0,3))
ax[1,1].plot(x,norm.pdf(x,2,1))
[<matplotlib.lines.Line2D at 0x1478f245e80>]
Image('../img/matplotlib-figure-anatomy.png')
# figsize=(width, height) specifies figure size in inches, default to (6.4, 4.8)
fig, ax = plt.subplots(figsize=(10,10))
x = np.linspace(0, 10, 50) # return 50 numbers between 0 and 10
# set the title and axes labels
ax.set_title('Figure for Demonstration', fontsize=20)
ax.set_xlabel('The value of x')
ax.set_ylabel('The value of y')
# set axes limits
ax.set_xlim(-3, 12)
ax.set_ylim(-200, 1200)
# set plot lable and show legend
ax.plot(x, x**3, label='y=x^3')
ax.plot(x, x**3 - 8*x**2 + 5, label='y=x^3-8x^2+5')
ax.plot(x, 2**x, label='y=2^x')
ax.legend()
<matplotlib.legend.Legend at 0x1478f962160>
fig = plt.figure(figsize=(10, 10))
ax = plt.axes()
# line style
ax.plot(x, x, linestyle='solid')
ax.plot(x, x + 1, linestyle='dashed')
ax.plot(x, x + 2, linestyle='dashdot')
ax.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following codes:
ax.plot(x, x + 5, linestyle='-') # solid
ax.plot(x, x + 6, linestyle='--') # dashed
ax.plot(x, x + 7, linestyle='-.') # dashdot
ax.plot(x, x + 8, linestyle=':'); # dotted
# color
ax.plot(x, x + 10, color='blue') # specify color by name
ax.plot(x, x + 11, color='g') # short color code (rgbcmyk)
ax.plot(x, x + 12, color='0.75') # Grayscale between 0 and 1
ax.plot(x, x + 13, color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
ax.plot(x, x + 14, color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
ax.plot(x, x + 15, color='chartreuse'); # all HTML color names supported
# combine line style and color
ax.plot(x, x + 17, '-g') # solid green
ax.plot(x, x + 18, '--c') # dashed cyan
ax.plot(x, x + 19, '-.k') # dashdot black
ax.plot(x, x + 20, ':r'); # dotted red
Image('../img/filled-markers.png')
fig = plt.figure(figsize=(10,10))
ax1 = plt.axes()
ax1.plot(x, x**2, 'o', label='y=2^x')
ax1.plot(x, x**3, '-s', label='y=2^x')
ax1.plot(x, x**2+200, '--D', label='y=2^x')
ax1.legend() # show legend
<matplotlib.legend.Legend at 0x1478fa49610>
Plot the following function of x and y:
Paint it in red and use big diamond markers (D).
# Add your code here:
fig = plt.figure(figsize=(10,10))
ax = plt.axes()
t = np.linspace(-10, 10, 200) # 200 numbers between -10 and 10
x = 16*(np.sin(t))**3
y = (13*np.cos(t))-(5*np.cos(2*t))-(2*np.cos(3*t))-(np.cos(4*t))
ax.plot(x,y,'-D',color='r')
[<matplotlib.lines.Line2D at 0x1478ffcdfa0>]
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.7,
cmap='viridis')
plt.colorbar(); # show color scale
Create a scatter plot of two series, $x$ and $y$, where:
# Add your code here:
x = np.linspace(0,100,100)
e = np.random.normal(0,5)
y = 10 - 2 * x + x ** 2 + e
plt.scatter(x, y, alpha=0.2)
<matplotlib.collections.PathCollection at 0x1478fe58af0>
data = np.random.randn(1000)
plt.hist(data)
(array([ 8., 41., 123., 233., 290., 203., 78., 19., 4., 1.]),
array([-3.10920911, -2.38494265, -1.6606762 , -0.93640974, -0.21214329,
0.51212317, 1.23638963, 1.96065608, 2.68492254, 3.40918899,
4.13345545]),
<BarContainer object of 10 artists>)
plt.hist(data, bins=30, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
plt.hist(x1, **kwargs, label='x1')
plt.hist(x2, **kwargs, label='x2')
plt.hist(x3, **kwargs, label='x3')
plt.legend()
<matplotlib.legend.Legend at 0x1478fee65e0>
Create a figure that overlays four histograms that shows one of the following probability distributions:
Hint:
# Add your code here:
x1 = np.random.normal(0,1, 1000)
x2 = np.random.normal(0,2, 1000)
x3 = np.random.uniform(-2,2, 1000)
x4 = np.random.exponential(scale=0.5,size=1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
plt.hist(x1, **kwargs, label='x1')
plt.hist(x2, **kwargs, label='x2')
plt.hist(x3, **kwargs, label='x3')
plt.hist(x4, **kwargs, label='x4')
plt.legend()
<matplotlib.legend.Legend at 0x14790315820>
# Put these four histograms in four subplots in a 2x2 figure
x1 = np.random.normal(0,1, 1000)
x2 = np.random.normal(0,2, 1000)
x3 = np.random.uniform(-2,2, 1000)
x4 = np.random.exponential(scale=0.5,size=1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
fig, axs = plt.subplots(2,2)
axs[0,0].hist(x1, **kwargs, label='x1')
axs[0,1].hist(x2, **kwargs, label='x2')
axs[1,0].hist(x3, **kwargs, label='x3')
axs[1,1].hist(x4, **kwargs, label='x4')
plt.legend()
<matplotlib.legend.Legend at 0x1478fef34c0>
A picture is woth a thousand words. Data visualization can help us uncover relationships and patterns that are hidden in our data.
First, we will use graphs to answer some car-related question: Do cars with big engines use more fuel than cars with small engines? What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?
import pandas as pd
mpg = pd.read_csv('../data/mpg.csv', header=0)
mpg.head()
| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| 1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| 2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
| 3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
| 4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
mpg.shape
(234, 11)
This dataframe contains 234 rows and 11 variables:
# basic shape, data type, null values
mpg.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 234 entries, 0 to 233 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 manufacturer 234 non-null object 1 model 234 non-null object 2 displ 234 non-null float64 3 year 234 non-null int64 4 cyl 234 non-null int64 5 trans 234 non-null object 6 drv 234 non-null object 7 cty 234 non-null int64 8 hwy 234 non-null int64 9 fl 234 non-null object 10 class 234 non-null object dtypes: float64(1), int64(4), object(6) memory usage: 20.2+ KB
mpg.describe()
| displ | year | cyl | cty | hwy | |
|---|---|---|---|---|---|
| count | 234.000000 | 234.000000 | 234.000000 | 234.000000 | 234.000000 |
| mean | 3.471795 | 2003.500000 | 5.888889 | 16.858974 | 23.440171 |
| std | 1.291959 | 4.509646 | 1.611534 | 4.255946 | 5.954643 |
| min | 1.600000 | 1999.000000 | 4.000000 | 9.000000 | 12.000000 |
| 25% | 2.400000 | 1999.000000 | 4.000000 | 14.000000 | 18.000000 |
| 50% | 3.300000 | 2003.500000 | 6.000000 | 17.000000 | 24.000000 |
| 75% | 4.600000 | 2008.000000 | 8.000000 | 19.000000 | 27.000000 |
| max | 7.000000 | 2008.000000 | 8.000000 | 35.000000 | 44.000000 |
mpg.describe(include='all')
| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 234 | 234 | 234.000000 | 234.000000 | 234.000000 | 234 | 234 | 234.000000 | 234.000000 | 234 | 234 |
| unique | 15 | 38 | NaN | NaN | NaN | 10 | 3 | NaN | NaN | 5 | 7 |
| top | dodge | caravan 2wd | NaN | NaN | NaN | auto(l4) | f | NaN | NaN | r | suv |
| freq | 37 | 11 | NaN | NaN | NaN | 83 | 106 | NaN | NaN | 168 | 62 |
| mean | NaN | NaN | 3.471795 | 2003.500000 | 5.888889 | NaN | NaN | 16.858974 | 23.440171 | NaN | NaN |
| std | NaN | NaN | 1.291959 | 4.509646 | 1.611534 | NaN | NaN | 4.255946 | 5.954643 | NaN | NaN |
| min | NaN | NaN | 1.600000 | 1999.000000 | 4.000000 | NaN | NaN | 9.000000 | 12.000000 | NaN | NaN |
| 25% | NaN | NaN | 2.400000 | 1999.000000 | 4.000000 | NaN | NaN | 14.000000 | 18.000000 | NaN | NaN |
| 50% | NaN | NaN | 3.300000 | 2003.500000 | 6.000000 | NaN | NaN | 17.000000 | 24.000000 | NaN | NaN |
| 75% | NaN | NaN | 4.600000 | 2008.000000 | 8.000000 | NaN | NaN | 19.000000 | 27.000000 | NaN | NaN |
| max | NaN | NaN | 7.000000 | 2008.000000 | 8.000000 | NaN | NaN | 35.000000 | 44.000000 | NaN | NaN |
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
In the mpg dataset, the two main variables of interest are engine size (displ) and fuel efficiency (hwy). They are both continuous variables. We can use a scatterplot to show their relationship.
Scatter plot is often used for correlation analysis between different features. Correlation coefficient is between -1 and 1, representing negative and positive correlations. 0 means there is no liner correlation. Correlation is said to be linear if the ratio of change is constant, otherwise is non-linear.
plt.scatter(x='displ',
y='hwy',
data=mpg)
plt.xlabel('displ')
plt.ylabel('hwy')
Text(0, 0.5, 'hwy')
# A pandas dataframe can reference the matplotlib API
mpg.plot.scatter(x='displ', y='hwy')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Change the color
mpg.plot.scatter(x='displ', y='hwy', c='red')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
In the mpg dataset, there are other variables. How do some of these variables affect the relationship between engine size (displ) and fuel efficiency (hwy)?
For instance, we can add a third variable, like class, to a two dimensional scatterplot to indicate a certain property of objects by color, size, or shape of points.
This is doable in maplotlib but a package named "seaborn" does this much more easily.
import seaborn as sns
g = sns.scatterplot(data=mpg, x='displ', y='hwy', hue='class')
g = sns.scatterplot(data=mpg, x='displ', y='hwy', hue='class')
g.legend(loc='right', bbox_to_anchor=(1.35, 0.75), ncol=1)
<matplotlib.legend.Legend at 0x2d5eac917f0>
# Use size to represent different classes of cars.
g = sns.scatterplot(data=mpg, x='displ', y='hwy', size='class')
g.legend(loc='right', bbox_to_anchor=(1.35, 0.75), ncol=1)
<matplotlib.legend.Legend at 0x2d5ecf03d00>
# Use style to represent different classes of cars.
g = sns.scatterplot(data=mpg, x='displ', y='hwy', style='class')
g.legend(loc='right', bbox_to_anchor=(1.35, 0.75), ncol=1)
<matplotlib.legend.Legend at 0x2d5ecf9b4c0>
# Map a continuous variable to color or size.
g = sns.scatterplot(data=mpg, x='displ', y='hwy', hue='cty')
g.legend(loc='right', bbox_to_anchor=(1.35, 0.75), ncol=1)
<matplotlib.legend.Legend at 0x2d5ed036d00>
g = sns.scatterplot(data=mpg, x='displ', y='hwy', size='cty')
g.legend(loc='right', bbox_to_anchor=(1.35, 0.75), ncol=1)
<matplotlib.legend.Legend at 0x2d5ed0c00d0>
# What happens if you map the same variable to multiple aesthetics (e.g., color and size)?
g = sns.scatterplot(data=mpg, x='displ', y='hwy', hue='class', size='class')
g.legend(loc='right', bbox_to_anchor=(1.35, 0.75), ncol=1)
<matplotlib.legend.Legend at 0x2d5ed1687c0>
We can create different types of plots for data visualization.
Can we create a line chart to show the relationship between displ and hwy?
# Try this
plt.plot(mpg.displ, mpg.hwy)
[<matplotlib.lines.Line2D at 0x2d5ee2244f0>]
# Use this instead
sns.lineplot(data=mpg, x='displ', y='hwy')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Fit a trendline (linear with order=1)
sns.regplot(data=mpg, x='displ', y='hwy', order=1)
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Fit a non-linear trendline with order=2)
sns.regplot(data=mpg, x='displ', y='hwy', order=2)
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Add a new variable into the line chart as color
sns.lineplot(data=mpg, x='displ', y='hwy',hue='drv')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Add a new variable into the line chart as line style
sns.lineplot(data=mpg, x='displ', y='hwy',style='drv')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Overlay multiple plots in one chart
sns.scatterplot(data=mpg, x='displ', y='hwy', hue='class')
sns.lineplot(data=mpg, x='displ', y='hwy')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
sns.scatterplot(data=mpg, x='displ', y='hwy', hue='class')
sns.regplot(data=mpg, x='displ', y='hwy', order=2, scatter=False) # Hide the scatter points
<AxesSubplot:xlabel='displ', ylabel='hwy'>
Bar charts seem simple, but they are interesting because they reveal something subtle about plots.
The diamonds dataset contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
dmds = sns.load_dataset('diamonds')
dmds.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
dmds.describe(include='all')
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 53940.000000 | 53940 | 53940 | 53940 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
| unique | NaN | 5 | 7 | 8 | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | Ideal | G | SI1 | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 21551 | 11292 | 13065 | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 0.797940 | NaN | NaN | NaN | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
| std | 0.474011 | NaN | NaN | NaN | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
| min | 0.200000 | NaN | NaN | NaN | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.400000 | NaN | NaN | NaN | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | NaN | NaN | NaN | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 1.040000 | NaN | NaN | NaN | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
| max | 5.010000 | NaN | NaN | NaN | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut.
dmds.cut.value_counts(ascending=False).plot(kind='bar')
<AxesSubplot:>
sns.countplot(data=dmds, x='cut')
<AxesSubplot:xlabel='cut', ylabel='count'>
On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from?
Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values (i.e., stats) to plot.
# Sort the values in ascending order
order = dmds['cut'].value_counts(ascending=True).index
sns.countplot(data=dmds, x='cut',order=order)
plt.show()
# Sort the values in alphabetical order of 'cut'
cut_list = dmds.cut.unique().tolist()
cut_sorted = sorted(cut_list)
sns.countplot(data=dmds, x='cut',order=cut_sorted)
plt.show()
# Add a new variable into the bar chart as color
sns.countplot(data=dmds, x='cut', hue='clarity')
plt.show()
# You can also flip the chart by putting 'clarity' on y axis.
sns.countplot(data=dmds, y='cut', hue='clarity')
<AxesSubplot:xlabel='count', ylabel='cut'>
Similarly, we can use histograms to check the followings:
sns.histplot(data=dmds, x='cut')
plt.show()
# In histplot(), you can choose to show the probability rather than count.
sns.histplot(data=dmds, x='cut', stat='probability')
plt.show()
# Add another variable to histplot
sns.histplot(data=dmds, x='cut', hue='clarity', stat='probability', multiple='dodge')
<AxesSubplot:xlabel='cut', ylabel='Probability'>
# Stack the bars
sns.histplot(data=dmds, x='cut', hue='clarity',multiple='stack')
<AxesSubplot:xlabel='cut', ylabel='Count'>
sns.histplot(data=dmds, x='price')
<AxesSubplot:xlabel='price', ylabel='Count'>
sns.histplot(data=dmds, x='price', bins=50)
<AxesSubplot:xlabel='price', ylabel='Count'>
sns.histplot(data=dmds, x='price', bins=20, hue='cut', multiple="stack")
plt.show()
sns.histplot(data=dmds, x='price', bins=20, hue='cut', stat="probability", multiple="dodge", common_norm=False)
plt.show()
adapted from: https://en.wikipedia.org/wiki/Box_plot:
A boxplot displays the dataset based on a five-number summary:
IQR is used to determine outliers, which are points that are either greater than Q3+1.5IQR or less than Q1-1.5IQR.
dmds['price'].plot.box()
<AxesSubplot:>
sns.boxplot(data=dmds, y='price')
<AxesSubplot:ylabel='price'>
sns.boxplot(data=dmds, x='cut', y='price')
<AxesSubplot:xlabel='cut', ylabel='price'>
sns.boxplot(data=dmds, x='cut', y='depth')
<AxesSubplot:xlabel='cut', ylabel='depth'>
# We have created a scatterplot of two numerical variables.
sns.scatterplot(data=mpg, x='displ', y='hwy')
<AxesSubplot:xlabel='displ', ylabel='hwy'>
# Now, let's try creating a scatterplot of two categorical variables.
sns.scatterplot(data=mpg, x='drv', y='class')
<AxesSubplot:xlabel='drv', ylabel='class'>
Why are there so few points? Overlapping. This does not show the "density" of data points.
Consider using a stripplot() when at least one variable is categorical.
sns.stripplot(data=mpg, x='drv', y='class')
<AxesSubplot:xlabel='drv', ylabel='class'>
# You can use this for plotting a categorical and a numerical variable, too.
sns.stripplot(data=mpg, x='class', y='hwy')
<AxesSubplot:xlabel='class', ylabel='hwy'>
# What happens if you map an aesthetic to something other than a variable name?
# In the scatter plot, Differentiate datapoints (cars) with displ < 5 with a different color.
# Create a new series with boolean values
displ5 = mpg['displ']<5
#displ5
g = sns.scatterplot(data=mpg, x='cyl', y='hwy', hue=displ5)
Not so easy with basic functions. Need to perform some aggregation functions first.
dmds.cut.value_counts().plot(kind='pie')
<AxesSubplot:ylabel='cut'>
mpg['class'].value_counts().plot(kind='pie')
<AxesSubplot:ylabel='class'>
In the next few assignments, you will be working with this data set of IMDB top 1000 movies.
Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
import pandas as pd
import numpy as np
# Read the data file "imdb_top_1000.csv" to a dataframe named "imdb"
imdb = pd.read_csv('../data/imdb_top_1000.csv', header=0)
imdb.head()
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28,341,469 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134,966,411 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534,858,444 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57,300,000 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4,360,000 |
# Describe the dataframe using the info() method.
imdb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Poster_Link 1000 non-null object 1 Series_Title 1000 non-null object 2 Released_Year 1000 non-null object 3 Certificate 899 non-null object 4 Runtime 1000 non-null object 5 Genre 1000 non-null object 6 IMDB_Rating 1000 non-null float64 7 Overview 1000 non-null object 8 Meta_score 843 non-null float64 9 Director 1000 non-null object 10 Star1 1000 non-null object 11 Star2 1000 non-null object 12 Star3 1000 non-null object 13 Star4 1000 non-null object 14 No_of_Votes 1000 non-null int64 15 Gross 831 non-null object dtypes: float64(2), int64(1), object(13) memory usage: 125.1+ KB
# List all the column names:
imdb.columns
Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
dtype='object')
# Display top 10 movies's title, released year and IMDB rating.
imdb[['Series_Title','Released_Year','IMDB_Rating']].head(10)
| Series_Title | Released_Year | IMDB_Rating | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | 9.3 |
| 1 | The Godfather | 1972 | 9.2 |
| 2 | The Dark Knight | 2008 | 9.0 |
| 3 | The Godfather: Part II | 1974 | 9.0 |
| 4 | 12 Angry Men | 1957 | 9.0 |
| 5 | The Lord of the Rings: The Return of the King | 2003 | 8.9 |
| 6 | Pulp Fiction | 1994 | 8.9 |
| 7 | Schindler's List | 1993 | 8.9 |
| 8 | Inception | 2010 | 8.8 |
| 9 | Fight Club | 1999 | 8.8 |
# Display moviess ranked 11-20.
# Show their title, released year and IMDB rating.
imdb.iloc[11:21,[1,2,6]]
| Series_Title | Released_Year | IMDB_Rating | |
|---|---|---|---|
| 11 | Forrest Gump | 1994 | 8.8 |
| 12 | Il buono, il brutto, il cattivo | 1966 | 8.8 |
| 13 | The Lord of the Rings: The Two Towers | 2002 | 8.7 |
| 14 | The Matrix | 1999 | 8.7 |
| 15 | Goodfellas | 1990 | 8.7 |
| 16 | Star Wars: Episode V - The Empire Strikes Back | 1980 | 8.7 |
| 17 | One Flew Over the Cuckoo's Nest | 1975 | 8.7 |
| 18 | Hamilton | 2020 | 8.6 |
| 19 | Gisaengchung | 2019 | 8.6 |
| 20 | Soorarai Pottru | 2020 | 8.6 |
# Select all movies directed by Quentin Tarantino.
# Show their title, released year, IMDB rating, and gross.
imdb.loc[imdb['Director']=='Quentin Tarantino'
,['Series_Title','Released_Year','IMDB_Rating','Gross']]
| Series_Title | Released_Year | IMDB_Rating | Gross | |
|---|---|---|---|---|
| 6 | Pulp Fiction | 1994 | 8.9 | 107,928,762 |
| 62 | Django Unchained | 2012 | 8.4 | 162,805,434 |
| 93 | Inglourious Basterds | 2009 | 8.3 | 120,540,719 |
| 103 | Reservoir Dogs | 1992 | 8.3 | 2,832,029 |
| 241 | Kill Bill: Vol. 1 | 2003 | 8.1 | 70,099,045 |
| 369 | Kill Bill: Vol. 2 | 2004 | 8.0 | 66,208,183 |
| 584 | The Hateful Eight | 2015 | 7.8 | 54,117,416 |
| 879 | Once Upon a Time... in Hollywood | 2019 | 7.6 | 142,502,728 |
# Select all R rated movies with IMDB_Rating>=8.5
# Show their title, released year, certificate, and IMDB rating.
imdb.loc[(imdb['Certificate']=='R')&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','IMDB_Rating']]
| Series_Title | Released_Year | Certificate | IMDB_Rating | |
|---|---|---|---|---|
| 24 | Saving Private Ryan | 1998 | R | 8.6 |
| 38 | The Pianist | 2002 | R | 8.5 |
| 40 | American History X | 1998 | R | 8.5 |
# How many unique values are there in the column "Released_Year"?
# Hint: nuniuqe()
imdb['Released_Year'].nunique()
100
# Count the number of movies in each "Released_Year"?
# Hint: value_counts()
imdb['Released_Year'].value_counts()
2014 32
2004 31
2009 29
2013 28
2016 28
..
1926 1
1936 1
1924 1
1921 1
PG 1
Name: Released_Year, Length: 100, dtype: int64
# In this dataset, there is a movie with an error in "Released_Year".
# Hint: Released_Year should be a 4-digit integer but this movie's is not.
# Find this movie.
imdb.Released_Year.unique()
array(['1994', '1972', '2008', '1974', '1957', '2003', '1993', '2010',
'1999', '2001', '1966', '2002', '1990', '1980', '1975', '2020',
'2019', '2014', '1998', '1997', '1995', '1991', '1977', '1962',
'1954', '1946', '2011', '2006', '2000', '1988', '1985', '1968',
'1960', '1942', '1936', '1931', '2018', '2017', '2016', '2012',
'2009', '2007', '1984', '1981', '1979', '1971', '1963', '1964',
'1950', '1940', '2013', '2005', '2004', '1992', '1987', '1986',
'1983', '1976', '1973', '1965', '1959', '1958', '1952', '1948',
'1944', '1941', '1927', '1921', '2015', '1996', '1989', '1978',
'1961', '1955', '1953', '1925', '1924', '1982', '1967', '1951',
'1949', '1939', '1937', '1934', '1928', '1926', '1920', '1970',
'1969', '1956', '1947', '1945', '1930', '1938', '1935', '1933',
'1932', '1922', '1943', 'PG'], dtype=object)
# Correct the values for the corresponding columns ("Release_Year" and "Certificate").
# You may want to look up this movie on www.imdb.com.
# Hint: You can set value for a particular set by: df.loc[row_name, column_name] = new_value
imdb.loc[966,'Released_Year']=1995
imdb.loc[966,'Certificate']='PG'
imdb.iloc[966,:]
Poster_Link https://m.media-amazon.com/images/M/MV5BNjEzYj... Series_Title Apollo 13 Released_Year 1995 Certificate PG Runtime 140 min Genre Adventure, Drama, History IMDB_Rating 7.6 Overview NASA must devise a strategy to return Apollo 1... Meta_score 77.0 Director Ron Howard Star1 Tom Hanks Star2 Bill Paxton Star3 Kevin Bacon Star4 Gary Sinise No_of_Votes 269197 Gross 173,837,933 Name: 966, dtype: object
# Change the data type of "Released_Year" to int
imdb['Released_Year']=imdb['Released_Year'].apply(int)
imdb.dtypes
Poster_Link object Series_Title object Released_Year int64 Certificate object Runtime object Genre object IMDB_Rating float64 Overview object Meta_score float64 Director object Star1 object Star2 object Star3 object Star4 object No_of_Votes int64 Gross object dtype: object
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, certificate, and IMDB rating.
imdb.loc[(imdb['Released_Year']>=2010)&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','IMDB_Rating']]
| Series_Title | Released_Year | Certificate | IMDB_Rating | |
|---|---|---|---|---|
| 8 | Inception | 2010 | UA | 8.8 |
| 18 | Hamilton | 2020 | PG-13 | 8.6 |
| 19 | Gisaengchung | 2019 | A | 8.6 |
| 20 | Soorarai Pottru | 2020 | U | 8.6 |
| 21 | Interstellar | 2014 | UA | 8.6 |
| 33 | Joker | 2019 | A | 8.5 |
| 34 | Whiplash | 2014 | A | 8.5 |
| 35 | The Intouchables | 2011 | UA | 8.5 |
# Select all movies whose genres contain 'Animation'
imdb1=imdb.dropna()
imdb1[imdb1['Genre'].str.contains('Animation')] #.Series_Title.count()
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | https://m.media-amazon.com/images/M/MV5BMjlmZm... | Sen to Chihiro no kamikakushi | 2001 | U | 125 min | Animation, Adventure, Family | 8.6 | During her family's move to the suburbs, a sul... | 96.0 | Hayao Miyazaki | Daveigh Chase | Suzanne Pleshette | Miyu Irino | Rumi Hiiragi | 651376 | 10,055,859 |
| 43 | https://m.media-amazon.com/images/M/MV5BYTYxNG... | The Lion King | 1994 | U | 88 min | Animation, Adventure, Drama | 8.5 | Lion prince Simba and his father are targeted ... | 88.0 | Roger Allers | Rob Minkoff | Matthew Broderick | Jeremy Irons | James Earl Jones | 942045 | 422,783,777 |
| 56 | https://m.media-amazon.com/images/M/MV5BODRmZD... | Kimi no na wa. | 2016 | U | 106 min | Animation, Drama, Fantasy | 8.4 | Two strangers find themselves linked in a biza... | 79.0 | Makoto Shinkai | Ryûnosuke Kamiki | Mone Kamishiraishi | Ryô Narita | Aoi Yûki | 194838 | 5,017,246 |
| 58 | https://m.media-amazon.com/images/M/MV5BMjMwND... | Spider-Man: Into the Spider-Verse | 2018 | U | 117 min | Animation, Action, Adventure | 8.4 | Teen Miles Morales becomes the Spider-Man of h... | 87.0 | Bob Persichetti | Peter Ramsey | Rodney Rothman | Shameik Moore | Jake Johnson | 375110 | 190,241,310 |
| 61 | https://m.media-amazon.com/images/M/MV5BYjQ5Nj... | Coco | 2017 | U | 105 min | Animation, Adventure, Family | 8.4 | Aspiring musician Miguel, confronted with his ... | 81.0 | Lee Unkrich | Adrian Molina | Anthony Gonzalez | Gael García Bernal | Benjamin Bratt | 384171 | 209,726,015 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 906 | https://m.media-amazon.com/images/M/MV5BMTY3Nj... | Despicable Me | 2010 | U | 95 min | Animation, Comedy, Crime | 7.6 | When a criminal mastermind uses a trio of orph... | 72.0 | Pierre Coffin | Chris Renaud | Steve Carell | Jason Segel | Russell Brand | 500851 | 251,513,985 |
| 956 | https://m.media-amazon.com/images/M/MV5BODkxNG... | Mulan | 1998 | U | 88 min | Animation, Adventure, Family | 7.6 | To save her father from death in the army, a y... | 71.0 | Tony Bancroft | Barry Cook | Ming-Na Wen | Eddie Murphy | BD Wong | 256906 | 120,620,254 |
| 971 | https://m.media-amazon.com/images/M/MV5BMTY5Nj... | Omohide poro poro | 1991 | U | 118 min | Animation, Drama, Romance | 7.6 | A twenty-seven-year-old office worker travels ... | 90.0 | Isao Takahata | Miki Imai | Toshirô Yanagiba | Yoko Honna | Mayumi Izuka | 27071 | 453,243 |
| 976 | https://m.media-amazon.com/images/M/MV5BN2JlZT... | The Little Mermaid | 1989 | U | 83 min | Animation, Family, Fantasy | 7.6 | A mermaid princess makes a Faustian bargain in... | 88.0 | Ron Clements | John Musker | Jodi Benson | Samuel E. Wright | Rene Auberjonois | 237696 | 111,543,479 |
| 992 | https://m.media-amazon.com/images/M/MV5BMjAwMT... | The Jungle Book | 1967 | U | 78 min | Animation, Adventure, Family | 7.6 | Bagheera the Panther and Baloo the Bear have a... | 65.0 | Wolfgang Reitherman | Phil Harris | Sebastian Cabot | Louis Prima | Bruce Reitherman | 166409 | 141,843,612 |
63 rows × 16 columns
# Create a new dataframe called "stars" including the following columns:
# Series_Title, Released_Year, Star1, Star2, Star3, Star4
stars = imdb.filter(['Series_Title','Released_Year'
,'Star1','Star2','Star3','Star4'])
stars
| Series_Title | Released_Year | Star1 | Star2 | Star3 | Star4 | |
|---|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler |
| 1 | The Godfather | 1972 | Marlon Brando | Al Pacino | James Caan | Diane Keaton |
| 2 | The Dark Knight | 2008 | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine |
| 3 | The Godfather: Part II | 1974 | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton |
| 4 | 12 Angry Men | 1957 | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen |
| 996 | Giant | 1956 | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker |
| 997 | From Here to Eternity | 1953 | Burt Lancaster | Montgomery Clift | Deborah Kerr | Donna Reed |
| 998 | Lifeboat | 1944 | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix |
| 999 | The 39 Steps | 1935 | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle |
1000 rows × 6 columns
# Create a new dataframe called "genres" including the following columns:
# Series_Title, Released_Year, Genre.
genres = imdb.filter(['Series_Title','Released_Year','Genre'])
genres
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Drama |
| 1 | The Godfather | 1972 | Crime, Drama |
| 2 | The Dark Knight | 2008 | Action, Crime, Drama |
| 3 | The Godfather: Part II | 1974 | Crime, Drama |
| 4 | 12 Angry Men | 1957 | Crime, Drama |
| ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Comedy, Drama, Romance |
| 996 | Giant | 1956 | Drama, Western |
| 997 | From Here to Eternity | 1953 | Drama, Romance, War |
| 998 | Lifeboat | 1944 | Drama, War |
| 999 | The 39 Steps | 1935 | Crime, Mystery, Thriller |
1000 rows × 3 columns
# Sorting:
# Sort dataframe genres in ascending order of "Released_Year"
genres.sort_values('Released_Year')
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 321 | Das Cabinet des Dr. Caligari | 1920 | Fantasy, Horror, Mystery |
| 127 | The Kid | 1921 | Comedy, Drama, Family |
| 568 | Nosferatu | 1922 | Fantasy, Horror |
| 194 | Sherlock Jr. | 1924 | Action, Comedy, Romance |
| 193 | The Gold Rush | 1925 | Adventure, Comedy, Drama |
| ... | ... | ... | ... |
| 20 | Soorarai Pottru | 2020 | Drama |
| 205 | Soul | 2020 | Animation, Adventure, Comedy |
| 613 | Druk | 2020 | Comedy, Drama |
| 464 | Dil Bechara | 2020 | Comedy, Drama, Romance |
| 612 | The Trial of the Chicago 7 | 2020 | Drama, History, Thriller |
1000 rows × 3 columns
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
imdb.loc[(imdb['Released_Year']>=2010)&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','Gross']]
imdb.sort_values('Gross',ascending=False)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 738 | https://m.media-amazon.com/images/M/MV5BOTc3Nz... | Rockstar | 2011 | UA | 159 min | Drama, Music, Musical | 7.7 | Janardhan Jakhar chases his dreams of becoming... | NaN | Imtiaz Ali | Ranbir Kapoor | Nargis Fakhri | Shammi Kapoor | Kumud Mishra | 39501 | 985,912 |
| 682 | https://m.media-amazon.com/images/M/MV5BZDRkOW... | The Color Purple | 1985 | U | 154 min | Drama | 7.8 | A black Southern woman struggles to find her i... | 78.0 | Steven Spielberg | Danny Glover | Whoopi Goldberg | Oprah Winfrey | Margaret Avery | 78321 | 98,467,863 |
| 194 | https://m.media-amazon.com/images/M/MV5BZWFhOG... | Sherlock Jr. | 1924 | Passed | 45 min | Action, Comedy, Romance | 8.2 | A film projectionist longs to be a detective, ... | NaN | Buster Keaton | Buster Keaton | Kathryn McGuire | Joe Keaton | Erwin Connelly | 41985 | 977,375 |
| 748 | https://m.media-amazon.com/images/M/MV5BOGUyZD... | The Social Network | 2010 | UA | 120 min | Biography, Drama | 7.7 | As Harvard student Mark Zuckerberg creates the... | 95.0 | David Fincher | Jesse Eisenberg | Andrew Garfield | Justin Timberlake | Rooney Mara | 624982 | 96,962,694 |
| 7 | https://m.media-amazon.com/images/M/MV5BNDE4OT... | Schindler's List | 1993 | A | 195 min | Biography, Drama, History | 8.9 | In German-occupied Poland during World War II,... | 94.0 | Steven Spielberg | Liam Neeson | Ralph Fiennes | Ben Kingsley | Caroline Goodall | 1213505 | 96,898,818 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 993 | https://m.media-amazon.com/images/M/MV5BYTE4YW... | Blowup | 1966 | A | 111 min | Drama, Mystery, Thriller | 7.6 | A fashion photographer unknowingly captures a ... | 82.0 | Michelangelo Antonioni | David Hemmings | Vanessa Redgrave | Sarah Miles | John Castle | 56513 | NaN |
| 995 | https://m.media-amazon.com/images/M/MV5BNGEwMT... | Breakfast at Tiffany's | 1961 | A | 115 min | Comedy, Drama, Romance | 7.6 | A young New York socialite becomes interested ... | 76.0 | Blake Edwards | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen | 166544 | NaN |
| 996 | https://m.media-amazon.com/images/M/MV5BODk3Yj... | Giant | 1956 | G | 201 min | Drama, Western | 7.6 | Sprawling epic covering the life of a Texas ca... | 84.0 | George Stevens | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker | 34075 | NaN |
| 998 | https://m.media-amazon.com/images/M/MV5BZTBmMj... | Lifeboat | 1944 | NaN | 97 min | Drama, War | 7.6 | Several survivors of a torpedoed merchant ship... | 78.0 | Alfred Hitchcock | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix | 26471 | NaN |
| 999 | https://m.media-amazon.com/images/M/MV5BMTY5OD... | The 39 Steps | 1935 | NaN | 86 min | Crime, Mystery, Thriller | 7.6 | A man in London tries to help a counter-espion... | 93.0 | Alfred Hitchcock | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle | 51853 | NaN |
1000 rows × 16 columns
# Does the sorting result looks right to you? What's the problem?
# Numbers order are not sorting, because it is str type
# Resolve this problem of "Gross" and convert its data type to float
# Hint: You may find this webpage useful:
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-column-of-a-pandas-dataframe
imdb['Gross']=imdb['Gross'].str.replace(',','')
imdb['Gross']=imdb['Gross'].apply(float)
imdb.dtypes
Poster_Link object Series_Title object Released_Year int64 Certificate object Runtime object Genre object IMDB_Rating float64 Overview object Meta_score float64 Director object Star1 object Star2 object Star3 object Star4 object No_of_Votes int64 Gross float64 dtype: object
# Next, redo the sorting on Gross
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
imdb.loc[(imdb['Released_Year']>=2010)&(imdb['IMDB_Rating']>=8.5)
,['Series_Title','Released_Year','Certificate','Gross']]
imdb.sort_values('Gross',ascending=False)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 477 | https://m.media-amazon.com/images/M/MV5BOTAzOD... | Star Wars: Episode VII - The Force Awakens | 2015 | U | 138 min | Action, Adventure, Sci-Fi | 7.9 | As a new threat to the galaxy rises, Rey, a de... | 80.0 | J.J. Abrams | Daisy Ridley | John Boyega | Oscar Isaac | Domhnall Gleeson | 860823 | 936662225.0 |
| 59 | https://m.media-amazon.com/images/M/MV5BMTc5MD... | Avengers: Endgame | 2019 | UA | 181 min | Action, Adventure, Drama | 8.4 | After the devastating events of Avengers: Infi... | 78.0 | Anthony Russo | Joe Russo | Robert Downey Jr. | Chris Evans | Mark Ruffalo | 809955 | 858373000.0 |
| 623 | https://m.media-amazon.com/images/M/MV5BMTYwOT... | Avatar | 2009 | UA | 162 min | Action, Adventure, Fantasy | 7.8 | A paraplegic Marine dispatched to the moon Pan... | 83.0 | James Cameron | Sam Worthington | Zoe Saldana | Sigourney Weaver | Michelle Rodriguez | 1118998 | 760507625.0 |
| 60 | https://m.media-amazon.com/images/M/MV5BMjMxNj... | Avengers: Infinity War | 2018 | UA | 149 min | Action, Adventure, Sci-Fi | 8.4 | The Avengers and their allies must be willing ... | 68.0 | Anthony Russo | Joe Russo | Robert Downey Jr. | Chris Hemsworth | Mark Ruffalo | 834477 | 678815482.0 |
| 652 | https://m.media-amazon.com/images/M/MV5BMDdmZG... | Titanic | 1997 | UA | 194 min | Drama, Romance | 7.8 | A seventeen-year-old aristocrat falls in love ... | 75.0 | James Cameron | Leonardo DiCaprio | Kate Winslet | Billy Zane | Kathy Bates | 1046089 | 659325379.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 993 | https://m.media-amazon.com/images/M/MV5BYTE4YW... | Blowup | 1966 | A | 111 min | Drama, Mystery, Thriller | 7.6 | A fashion photographer unknowingly captures a ... | 82.0 | Michelangelo Antonioni | David Hemmings | Vanessa Redgrave | Sarah Miles | John Castle | 56513 | NaN |
| 995 | https://m.media-amazon.com/images/M/MV5BNGEwMT... | Breakfast at Tiffany's | 1961 | A | 115 min | Comedy, Drama, Romance | 7.6 | A young New York socialite becomes interested ... | 76.0 | Blake Edwards | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen | 166544 | NaN |
| 996 | https://m.media-amazon.com/images/M/MV5BODk3Yj... | Giant | 1956 | G | 201 min | Drama, Western | 7.6 | Sprawling epic covering the life of a Texas ca... | 84.0 | George Stevens | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker | 34075 | NaN |
| 998 | https://m.media-amazon.com/images/M/MV5BZTBmMj... | Lifeboat | 1944 | NaN | 97 min | Drama, War | 7.6 | Several survivors of a torpedoed merchant ship... | 78.0 | Alfred Hitchcock | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix | 26471 | NaN |
| 999 | https://m.media-amazon.com/images/M/MV5BMTY5OD... | The 39 Steps | 1935 | NaN | 86 min | Crime, Mystery, Thriller | 7.6 | A man in London tries to help a counter-espion... | 93.0 | Alfred Hitchcock | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle | 51853 | NaN |
1000 rows × 16 columns
# Add a new column "Runtime_min" by removing the substring ' min" in "Runtime"
# Set its data type as int
# Hint: https://stackoverflow.com/questions/36505847/substring-of-an-entire-column-in-pandas-dataframe
imdb['Runtime_min']=imdb['Runtime'].str.replace('min','')
imdb['Runtime_min']=imdb['Runtime_min'].apply(int)
imdb.dtypes
Poster_Link object Series_Title object Released_Year int64 Certificate object Runtime object Genre object IMDB_Rating float64 Overview object Meta_score float64 Director object Star1 object Star2 object Star3 object Star4 object No_of_Votes int64 Gross float64 Runtime_min int64 dtype: object
# Add a new column "Age_Year" by expression: 2021 - Released_Year
imdb['Age_Year'] = 2021 - imdb.Released_Year
imdb.head(3)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | Runtime_min | Age_Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28341469.0 | 142 | 27 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134966411.0 | 175 | 49 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534858444.0 | 152 | 13 |
# Add a new column "Decade" with values as 1980, 1990, 2000, 2010, 2020, etc.
imdb['Decade']=imdb.Released_Year//10*10
imdb.head(10)
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | Runtime_min | Age_Year | Decade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28341469.0 | 142 | 27 | 1990 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134966411.0 | 175 | 49 | 1970 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534858444.0 | 152 | 13 | 2000 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57300000.0 | 202 | 47 | 1970 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4360000.0 | 96 | 64 | 1950 |
| 5 | https://m.media-amazon.com/images/M/MV5BNzA5ZD... | The Lord of the Rings: The Return of the King | 2003 | U | 201 min | Action, Adventure, Drama | 8.9 | Gandalf and Aragorn lead the World of Men agai... | 94.0 | Peter Jackson | Elijah Wood | Viggo Mortensen | Ian McKellen | Orlando Bloom | 1642758 | 377845905.0 | 201 | 18 | 2000 |
| 6 | https://m.media-amazon.com/images/M/MV5BNGNhMD... | Pulp Fiction | 1994 | A | 154 min | Crime, Drama | 8.9 | The lives of two mob hitmen, a boxer, a gangst... | 94.0 | Quentin Tarantino | John Travolta | Uma Thurman | Samuel L. Jackson | Bruce Willis | 1826188 | 107928762.0 | 154 | 27 | 1990 |
| 7 | https://m.media-amazon.com/images/M/MV5BNDE4OT... | Schindler's List | 1993 | A | 195 min | Biography, Drama, History | 8.9 | In German-occupied Poland during World War II,... | 94.0 | Steven Spielberg | Liam Neeson | Ralph Fiennes | Ben Kingsley | Caroline Goodall | 1213505 | 96898818.0 | 195 | 28 | 1990 |
| 8 | https://m.media-amazon.com/images/M/MV5BMjAxMz... | Inception | 2010 | UA | 148 min | Action, Adventure, Sci-Fi | 8.8 | A thief who steals corporate secrets through t... | 74.0 | Christopher Nolan | Leonardo DiCaprio | Joseph Gordon-Levitt | Elliot Page | Ken Watanabe | 2067042 | 292576195.0 | 148 | 11 | 2010 |
| 9 | https://m.media-amazon.com/images/M/MV5BMmEzNT... | Fight Club | 1999 | A | 139 min | Drama | 8.8 | An insomniac office worker and a devil-may-car... | 66.0 | David Fincher | Brad Pitt | Edward Norton | Meat Loaf | Zach Grenier | 1854740 | 37030102.0 | 139 | 22 | 1990 |
# Total "Gross" of all top 1000 movies
#(
# imdb
# .groupby('Series_Title')
# .agg({'Gross':'sum'})
# .reset_index()
#)
imdb.Gross.sum()
56536877976.0
# Average "No_of_Votes" of all movies
#(
# imdb
# .groupby('Series_Title')
# .agg({'No_of_Votes':'mean'})
# .reset_index()
#)
imdb.No_of_Votes.mean()
273692.911
# Count movies in each decade (e.g., ..., 1980, 1990, 2000, 2010, 2020)
# Sort decades by the number of movies in descending order
(
imdb
.groupby('Decade')
.agg({'Series_Title':'count'})
.reset_index()
.sort_values('Series_Title', ascending=False)
)
| Decade | Series_Title | |
|---|---|---|
| 9 | 2010 | 242 |
| 8 | 2000 | 237 |
| 7 | 1990 | 151 |
| 6 | 1980 | 89 |
| 5 | 1970 | 76 |
| 4 | 1960 | 73 |
| 3 | 1950 | 56 |
| 2 | 1940 | 35 |
| 1 | 1930 | 24 |
| 0 | 1920 | 11 |
| 10 | 2020 | 6 |
# Count movies by different directors.
# Show the top 10 directors with the most movies in this list
(
imdb
.groupby('Director')
.agg({'Series_Title':'count'})
.reset_index()
.sort_values('Series_Title',ascending=False)
.head(10)
)
| Director | Series_Title | |
|---|---|---|
| 22 | Alfred Hitchcock | 14 |
| 470 | Steven Spielberg | 13 |
| 179 | Hayao Miyazaki | 11 |
| 313 | Martin Scorsese | 10 |
| 9 | Akira Kurosawa | 10 |
| 463 | Stanley Kubrick | 9 |
| 532 | Woody Allen | 9 |
| 49 | Billy Wilder | 9 |
| 391 | Quentin Tarantino | 8 |
| 83 | Christopher Nolan | 8 |
# For each director, calculate the number of movies, average IMDB_Rating, and total gross.
# Ranked in descending order of total gross
# Show the top 10 directors with the most gross
(
imdb
.groupby('Director')
.agg({'Series_Title':'count'
,'IMDB_Rating':'mean'
,'Gross':'sum'
})
.reset_index()
.sort_values('Gross', ascending=False)
.head(10)
)
| Director | Series_Title | IMDB_Rating | Gross | |
|---|---|---|---|---|
| 470 | Steven Spielberg | 13 | 8.030769 | 2.478133e+09 |
| 36 | Anthony Russo | 4 | 8.075000 | 2.205039e+09 |
| 83 | Christopher Nolan | 8 | 8.462500 | 1.937454e+09 |
| 202 | James Cameron | 5 | 8.080000 | 1.748237e+09 |
| 383 | Peter Jackson | 5 | 8.400000 | 1.597312e+09 |
| 195 | J.J. Abrams | 3 | 7.833333 | 1.423171e+09 |
| 58 | Brad Bird | 4 | 7.900000 | 1.099628e+09 |
| 426 | Robert Zemeckis | 5 | 8.120000 | 1.049446e+09 |
| 107 | David Yates | 3 | 7.800000 | 9.789537e+08 |
| 380 | Pete Docter | 4 | 8.125000 | 9.393821e+08 |
# Group movies by decade and director
# In each group (i.e., for each decade and each director),
# calculate the number of movies and average IMDB rating.
# Sort in descending order of movie count
(
imdb
.groupby(['Director','Decade'])
.agg({'Series_Title':'count'
,'IMDB_Rating':'mean'
})
.reset_index()
.sort_values('Series_Title', ascending=False)
.head(10)
)
| Director | Decade | Series_Title | IMDB_Rating | |
|---|---|---|---|---|
| 71 | Billy Wilder | 1950 | 6 | 8.133333 |
| 9 | Akira Kurosawa | 1950 | 5 | 8.260000 |
| 31 | Alfred Hitchcock | 1940 | 5 | 7.880000 |
| 120 | Clint Eastwood | 2000 | 5 | 7.940000 |
| 155 | Denis Villeneuve | 2010 | 5 | 7.980000 |
| 32 | Alfred Hitchcock | 1950 | 5 | 8.220000 |
| 645 | Steven Spielberg | 1980 | 5 | 7.980000 |
| 116 | Christopher Nolan | 2000 | 4 | 8.525000 |
| 611 | Sergio Leone | 1960 | 4 | 8.400000 |
| 729 | Woody Allen | 1980 | 4 | 7.800000 |
# Bonus Question 1:
# Find the top 3 highest rated movie of each year since 2010.
# Show their released year, ranking, title, and IMDB_Rating.
# Sort them in descending order of year and ascending order of ranking
# Bonus Question 2:
# Find all directors whose movies appeared in at least five different decades.
# Your result should include: director, decade, and the number of movies in the decade.
You need to complete all data processing/manipulation steps above before visualization.
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
import seaborn as sns
# Create a scatterplot to show the two scores "IMDB_Rating" and "Meta_score"
# What can you tell about this pair of scores?
imdb.plot.scatter( x='IMDB_Rating', y='Meta_score')
<AxesSubplot:xlabel='IMDB_Rating', ylabel='Meta_score'>
# Fit a trendline to show the relationship between the two scores
# Hint: sns.regplot()
# Try different order for the trendline
sns.regplot(data=imdb, x='IMDB_Rating', y='Meta_score', order=2)
<AxesSubplot:xlabel='IMDB_Rating', ylabel='Meta_score'>
# Do any of the data points in the scatterplot surprise you?
# Try to identify a couple of such movies.
imdb.query('IMDB_Rating<=8 & Meta_score<35')
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | Runtime_min | Age_Year | Decade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 356 | https://m.media-amazon.com/images/M/MV5BYmI1OD... | Tropa de Elite | 2007 | R | 115 min | Action, Crime, Drama | 8.0 | In 1997 Rio de Janeiro, Captain Nascimento has... | 33.0 | José Padilha | Wagner Moura | André Ramiro | Caio Junqueira | Milhem Cortaz | 98097 | 8060.0 | 115 | 14 | 2000 |
| 788 | https://m.media-amazon.com/images/M/MV5BYzEyNz... | I Am Sam | 2001 | PG-13 | 132 min | Drama | 7.7 | A mentally handicapped man fights for custody ... | 28.0 | Jessie Nelson | Sean Penn | Michelle Pfeiffer | Dakota Fanning | Dianne Wiest | 142863 | 40311852.0 | 132 | 20 | 2000 |
| 942 | https://m.media-amazon.com/images/M/MV5BODNiZm... | The Butterfly Effect | 2004 | U | 113 min | Drama, Sci-Fi, Thriller | 7.6 | Evan Treborn suffers blackouts during signific... | 30.0 | Eric Bress | J. Mackye Gruber | Ashton Kutcher | Amy Smart | Melora Walters | 451479 | 57938693.0 | 113 | 17 | 2000 |
# In the scatterplot, use color to differentiate movies from different decade.
sns.scatterplot(data=imdb, x='IMDB_Rating', y='Meta_score', hue='Decade')
<AxesSubplot:xlabel='IMDB_Rating', ylabel='Meta_score'>
# Create a chart to show the number of movies in each decade
order = imdb['Decade'].value_counts(ascending=True).index
sns.countplot(data=imdb, x='Decade',order=order)
plt.show()
# Create a chart to show the percentage of movies in each decade
sns.histplot(data=imdb, x='Decade', stat='probability')
plt.show()
# Count movies by different director.
# Show the top 10 directors with the most movies in a bar chart.
imdb.Director.value_counts(ascending=False).head(10).plot(kind='bar')
<AxesSubplot:>
# Create a scatterplot of "IMDB_Rating" and "Gross"
# Use color to differentiate movies from different decades
# What can you tell from the chart?
sns.scatterplot(data=imdb, x='IMDB_Rating', y='Gross', hue='Decade')
<AxesSubplot:xlabel='IMDB_Rating', ylabel='Gross'>
# Create a column (variable) called "Drama" to indicate if a movie's genres contain "Drama"
# Create a pie chart to show the composition
imdb['Drama'] = imdb.Genre.str.contains('Drama')
imdb.Drama.value_counts().plot(kind='pie')
<AxesSubplot:ylabel='Drama'>
# Create a plot to compare the gross of movies across decades
sns.boxplot(data=imdb, x='Decade', y='Gross')
<AxesSubplot:xlabel='Decade', ylabel='Gross'>
# For movies that gross over $100 million dollors
# Create a histogram of gross for drama vs. non-drama movies
#Filter first
imdb.query('Gross>100000000').Drama.value_counts().plot(kind='pie')
<AxesSubplot:ylabel='Drama'>
In any project of data analytics, we start with an important task called exploratory data analysis (EDA), i.e., using transformation, summarization, and visualisation to explore our data in a systematic way, often in an iterative cycle.
In EDA, you:
EDA is more than a formal process with a strict set of rules. Instead, think EDA as a state of mind. Initially, feel free to explore different ideas. Some ideas may be dead ends while some may lead to key insights.
EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large amount of questions. Answering these questions can help you know what insights are contained in your dataset, expose you to new aspects, and increase your chance of making a discovery.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
diamonds.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
diamonds.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53940 entries, 0 to 53939 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53940 non-null float64 1 cut 53940 non-null category 2 color 53940 non-null category 3 clarity 53940 non-null category 4 depth 53940 non-null float64 5 table 53940 non-null float64 6 price 53940 non-null int64 7 x 53940 non-null float64 8 y 53940 non-null float64 9 z 53940 non-null float64 dtypes: category(3), float64(6), int64(1) memory usage: 3.0 MB
diamonds.describe(include='all')
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 53940.000000 | 53940 | 53940 | 53940 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
| unique | NaN | 5 | 7 | 8 | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | Ideal | G | SI1 | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 21551 | 11292 | 13065 | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 0.797940 | NaN | NaN | NaN | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
| std | 0.474011 | NaN | NaN | NaN | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
| min | 0.200000 | NaN | NaN | NaN | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.400000 | NaN | NaN | NaN | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | NaN | NaN | NaN | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 1.040000 | NaN | NaN | NaN | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
| max | 5.010000 | NaN | NaN | NaN | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life. The best way to understand that pattern is to visualize the distribution of the variable’s values.
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous.
diamonds.cut.value_counts()
Ideal 21551 Premium 13791 Very Good 12082 Good 4906 Fair 1610 Name: cut, dtype: int64
diamonds.cut.value_counts().plot(kind='bar')
<AxesSubplot:>
sns.countplot(data=diamonds, x='cut')
<AxesSubplot:xlabel='cut', ylabel='count'>
# histograms for all numerical features
diamonds.hist(figsize=(15,15))
array([[<AxesSubplot:title={'center':'carat'}>,
<AxesSubplot:title={'center':'depth'}>,
<AxesSubplot:title={'center':'table'}>],
[<AxesSubplot:title={'center':'price'}>,
<AxesSubplot:title={'center':'x'}>,
<AxesSubplot:title={'center':'y'}>],
[<AxesSubplot:title={'center':'z'}>, <AxesSubplot:>,
<AxesSubplot:>]], dtype=object)
diamonds['carat'].hist()
<AxesSubplot:>
# Specify the number of bins (default=10)
diamonds['carat'].hist(bins=20)
<AxesSubplot:>
# Specify the bin width
binwidth = 0.25
diamonds['carat'].hist(bins=np.arange(0, diamonds['carat'].max()+binwidth, binwidth))
<AxesSubplot:>
# To zoom in on diamonds of carat<=3
diamonds[diamonds.carat<=3]['carat'].hist(bins=20)
<AxesSubplot:>
The Seaborn package has a variety of methods for visualizaing distributions of data:
sns.histplot(diamonds, x='carat')
<AxesSubplot:xlabel='carat', ylabel='Count'>
sns.histplot(diamonds, x='carat', stat='probability')
<AxesSubplot:xlabel='carat', ylabel='Probability'>
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:
Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset.
diamonds['y'].hist(bins=50)
# Look how wide the limits on the x-axis are.
<AxesSubplot:>
# Let's zoom in find where the outliers are.
ax = diamonds['y'].hist(bins=50)
ax.set_ylim(ymin=0, ymax = 20)
(0.0, 20.0)
# The chart above allows us to see that there are three unusual values: 0, ~30, and ~60.
# Let's pluck them out.
diamonds[(diamonds.y<3)|(diamonds.y>20)][['x','y','z']].sort_values('y')
# The y variable measures one of the three dimensions of these diamonds, in mm.
# Diamonds can’t have a width of 0mm, so these values must be incorrect.
# Measurements of 32mm and 59mm are implausible. Over an inch long!!!
| x | y | z | |
|---|---|---|---|
| 11963 | 0.00 | 0.0 | 0.00 |
| 15951 | 0.00 | 0.0 | 0.00 |
| 24520 | 0.00 | 0.0 | 0.00 |
| 26243 | 0.00 | 0.0 | 0.00 |
| 27429 | 0.00 | 0.0 | 0.00 |
| 49556 | 0.00 | 0.0 | 0.00 |
| 49557 | 0.00 | 0.0 | 0.00 |
| 49189 | 5.15 | 31.8 | 5.12 |
| 24067 | 8.09 | 58.9 | 8.06 |
# To drop the rows with the unusual values:
diamonds2 = diamonds[(diamonds.y>=3)&(diamonds.y<=20)]
print(f'Row count in the original dataset: {diamonds.shape[0]}')
print(f'Row count in the new dataset: {diamonds2.shape[0]}')
Row count in the original dataset: 53940 Row count in the new dataset: 53931
If variation describes the behavior within a variable, covariation describes the behavior between variables.
Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.
It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable.
# How the price of a diamond varies with its cut.
sns.histplot(diamonds, x="price", hue="cut",bins=20)
<AxesSubplot:xlabel='price', ylabel='Count'>
# The histogram above is not easy to see the differences between different cuts.
# We can change the argument multiple to make it better.
sns.histplot(diamonds, x="price", hue="cut",bins=20,multiple='stack')
#sns.histplot(diamonds, x="price", hue="cut",bins=20,multiple='dodge')
<AxesSubplot:xlabel='price', ylabel='Count'>
# You can use a displot() as well.
sns.displot(diamonds, x="price", hue="cut",bins=20,multiple='stack')
<seaborn.axisgrid.FacetGrid at 0x1c5972ec340>
# Set stat='probability' to change the scale of y axis
sns.displot(diamonds, x="price", hue="cut",bins=20,multiple='dodge',stat='probability')
<seaborn.axisgrid.FacetGrid at 0x1c59a9fd280>
These histrogams seem not that useful for comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape, e.g., the "fair" cut in this dataset.
# By setting common_norm=False, each subset will be normalized independently:
sns.displot(diamonds, x="price", hue="cut",bins=20,multiple='dodge',stat='probability',common_norm=False)
<seaborn.axisgrid.FacetGrid at 0x1c597c50910>
A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:
# Use a KDE plot to visualize the distribution of a continuous variable
sns.displot(diamonds, x="price", kind="kde")
<seaborn.axisgrid.FacetGrid at 0x1c59af341f0>
# You can use kedplot() as well
sns.kdeplot(data=diamonds, x="price")
<AxesSubplot:xlabel='price', ylabel='Density'>
# Use a KDE plot to visualize the distributions of price for different cuts
sns.displot(diamonds, x="price", hue="cut", kind="kde")
<seaborn.axisgrid.FacetGrid at 0x1c5975df730>
sns.kdeplot(data=diamonds, x="price", hue='cut')
<AxesSubplot:xlabel='price', ylabel='Density'>
Setting the argument common_norm.
# Normalize density for groups: The area below each curve is 1.
sns.kdeplot(data=diamonds, x="price", hue="cut", common_norm=False)
#sns.displot(diamonds, x="price", hue="cut", kind="kde", common_norm=False)
<AxesSubplot:xlabel='price', ylabel='Density'>
There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price!
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
# how to display an image
from IPython.display import Image
Image('../img/eda-boxplot.png')
# We can use a boxplot to identify outliers for y.
diamonds['y'].plot.box()
<AxesSubplot:>
# Create a boxplot on price
diamonds['price'].plot.box()
#diamonds['price'].plot(kind='box')
<AxesSubplot:>
# Distribution of price by cut
sns.boxplot(data=diamonds, x='cut', y='price')
<AxesSubplot:xlabel='cut', ylabel='price'>
Compared to histograms, we see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average!
my_order = diamonds.groupby('cut')['price'].median().sort_values().index
ax = sns.boxplot(data=diamonds, x='cut', y='price', order=my_order)
ax.set_yscale("log") # Transform the y scale to log for easier comparison
If you want to visualise the covariation between two continuous variables, draw a scatterplot.
diamonds.plot(x='carat', y='price', kind='scatter')
<AxesSubplot:xlabel='carat', ylabel='price'>
sns.scatterplot(data=diamonds,x='carat',y='price')
<AxesSubplot:xlabel='carat', ylabel='price'>
# Adjust alpha to add transparency
diamonds.plot(x='carat', y='price', kind='scatter', alpha=0.1)
#sns.scatterplot(data=diamonds,x='carat',y='price',alpha=0.1)
<AxesSubplot:xlabel='carat', ylabel='price'>
To visualise the covariation between categorical variables, you’ll need to count the number of observations for each combination.
dcc = diamonds.groupby(['cut','color'])['carat'].agg(['count'])
dcc
| count | ||
|---|---|---|
| cut | color | |
| Ideal | D | 2834 |
| E | 3903 | |
| F | 3826 | |
| G | 4884 | |
| H | 3115 | |
| I | 2093 | |
| J | 896 | |
| Premium | D | 1603 |
| E | 2337 | |
| F | 2331 | |
| G | 2924 | |
| H | 2360 | |
| I | 1428 | |
| J | 808 | |
| Very Good | D | 1513 |
| E | 2400 | |
| F | 2164 | |
| G | 2299 | |
| H | 1824 | |
| I | 1204 | |
| J | 678 | |
| Good | D | 662 |
| E | 933 | |
| F | 909 | |
| G | 871 | |
| H | 702 | |
| I | 522 | |
| J | 307 | |
| Fair | D | 163 |
| E | 224 | |
| F | 312 | |
| G | 314 | |
| H | 303 | |
| I | 175 | |
| J | 119 |
# Simply using a scatterplot may not show the pattern.
sns.scatterplot(data=diamonds, x='cut', y='color')
<AxesSubplot:xlabel='cut', ylabel='color'>
# The size of each circle in the plot displays how many observations occurred at each combination of values.
sns.scatterplot(data=dcc, x='cut', y='color', size='count')
<AxesSubplot:xlabel='cut', ylabel='color'>
# Create pivot table.
dpt = diamonds.pivot_table(values=['carat'], index=['cut'], columns=['color'], aggfunc=np.size)
dpt
| carat | |||||||
|---|---|---|---|---|---|---|---|
| color | D | E | F | G | H | I | J |
| cut | |||||||
| Ideal | 2834 | 3903 | 3826 | 4884 | 3115 | 2093 | 896 |
| Premium | 1603 | 2337 | 2331 | 2924 | 2360 | 1428 | 808 |
| Very Good | 1513 | 2400 | 2164 | 2299 | 1824 | 1204 | 678 |
| Good | 662 | 933 | 909 | 871 | 702 | 522 | 307 |
| Fair | 163 | 224 | 312 | 314 | 303 | 175 | 119 |
# Create a heatmap
sns.heatmap(data=dpt)
<AxesSubplot:xlabel='None-color', ylabel='cut'>
Create an account on kaggle.com and read the overview of the titanic competition at https://www.kaggle.com/c/titanic/overview, do the followings:
# import required packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# set matplotlib inline and use seaborn
%matplotlib inline
plt.style.use('seaborn')
Download the training dataset and rename it to titanic-train.csv and load it using pandas
# download the training dataset "train.csv", rename it "titanic-train.csv" and load it using pandas
df_train = pd.read_csv('../data/titanic_train.csv')
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
# 1)
# 891 rows and 12 features
# 2)
# Survived
# 3)
# Yes, In column Age(177) and Cabin(677) and Embarked (2)
# 4)
# PassengerId int64
# Survived int64
# Pclass int64
# Name object
# Sex object
# Age float64
# SibSp int64
# Parch int64
# Ticket object
# Fare float64
# Cabin object
# Embarked object
df_train.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df_train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
#df_train.columns
#df1 = df_train[[]]
#df1
df_train2=df_train.drop(columns = ['Age','Cabin','Embarked'])
df_train2.info() #9
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Ticket 891 non-null object 8 Fare 891 non-null float64 dtypes: float64(1), int64(5), object(3) memory usage: 62.8+ KB
df_train2.hist(figsize=(10,10))
array([[<AxesSubplot:title={'center':'PassengerId'}>,
<AxesSubplot:title={'center':'Survived'}>],
[<AxesSubplot:title={'center':'Pclass'}>,
<AxesSubplot:title={'center':'SibSp'}>],
[<AxesSubplot:title={'center':'Parch'}>,
<AxesSubplot:title={'center':'Fare'}>]], dtype=object)
df_train2.Survived.value_counts(normalize=True)
0 0.616162 1 0.383838 Name: Survived, dtype: float64
groupby()df_train2.groupby('Sex')['Survived'].value_counts(normalize=True).plot(kind='bar')
<AxesSubplot:xlabel='Sex,Survived'>
df_train2['SibSp'].plot.box()
<AxesSubplot:>
df_train2['Parch'].plot.box()
<AxesSubplot:>
df_train2['Fare'].plot.box()
<AxesSubplot:>
Fare feature to take a closer look at the data points. Hint: use df.index as x and use values of Fare as ydf_train2.reset_index().plot.scatter(x = 0 , y = 'Fare')
# 3
<AxesSubplot:xlabel='index', ylabel='Fare'>
sns.histplot(df_train2, x='Fare', hue='SibSp', bins=20)
<AxesSubplot:xlabel='Fare', ylabel='Count'>
sns.histplot(df_train2, x='Fare', hue='Survived', bins=20)
<AxesSubplot:xlabel='Fare', ylabel='Count'>
In this notebook, you will learn a useful way to organize your data, an organisation called tidy data. Getting your data into this format requires some upfront work, but that work pays off in the long term.
Tidy datasets are all alike, but every messy dataset is messy in its own way.
You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables country, year, population, and cases, but each dataset organizes the values in a different way.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
[['Afghanistan',1999,745,19987071],
['Afghanistan',2000,2666,20595360],
['Brazil',1999,37737,172006362],
['Brazil',2000,80488,174504898],
['China',1999,212258,1272915272],
['China',2000,213766,1280428583]],
columns=['country', 'year', 'cases', 'population'])
df1
| country | year | cases | population | |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 | 19987071 |
| 1 | Afghanistan | 2000 | 2666 | 20595360 |
| 2 | Brazil | 1999 | 37737 | 172006362 |
| 3 | Brazil | 2000 | 80488 | 174504898 |
| 4 | China | 1999 | 212258 | 1272915272 |
| 5 | China | 2000 | 213766 | 1280428583 |
df2 = pd.DataFrame(
[['Afghanistan',1999,'cases',745],
['Afghanistan',1999,'population',19987071],
['Afghanistan',2000,'cases',2666],
['Afghanistan',2000,'population',20595360],
['Brazil',1999,'cases',37737],
['Brazil',1999,'population',172006362],
['Brazil',2000,'cases',80488],
['Brazil',2000,'population',174504898],
['China',1999,'cases',212258],
['China',1999,'population',1272915272],
['China',2000,'cases',213766],
['China',2000,'population',1280428583]],
columns=['country', 'year', 'type', 'count'])
df2
| country | year | type | count | |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | cases | 745 |
| 1 | Afghanistan | 1999 | population | 19987071 |
| 2 | Afghanistan | 2000 | cases | 2666 |
| 3 | Afghanistan | 2000 | population | 20595360 |
| 4 | Brazil | 1999 | cases | 37737 |
| 5 | Brazil | 1999 | population | 172006362 |
| 6 | Brazil | 2000 | cases | 80488 |
| 7 | Brazil | 2000 | population | 174504898 |
| 8 | China | 1999 | cases | 212258 |
| 9 | China | 1999 | population | 1272915272 |
| 10 | China | 2000 | cases | 213766 |
| 11 | China | 2000 | population | 1280428583 |
df3 = pd.DataFrame(
[['Afghanistan',1999,'745/19987071'],
['Afghanistan',2000,'2666/20595360'],
['Brazil',1999,'37737/172006362'],
['Brazil',2000,'80488/174504898'],
['China',1999,'212258/1272915272'],
['China',2000,'213766/1280428583']],
columns=['country','year','rate'])
df3
| country | year | rate | |
|---|---|---|---|
| 0 | Afghanistan | 1999 | 745/19987071 |
| 1 | Afghanistan | 2000 | 2666/20595360 |
| 2 | Brazil | 1999 | 37737/172006362 |
| 3 | Brazil | 2000 | 80488/174504898 |
| 4 | China | 1999 | 212258/1272915272 |
| 5 | China | 2000 | 213766/1280428583 |
# cases
df4a = pd.DataFrame(
[['Afghanistan',745,2666],
['Brazil',37737,80488],
['China',212258,213766]],
columns=['country','1999','2000'])
df4a
| country | 1999 | 2000 | |
|---|---|---|---|
| 0 | Afghanistan | 745 | 2666 |
| 1 | Brazil | 37737 | 80488 |
| 2 | China | 212258 | 213766 |
# population
df4b = pd.DataFrame(
[['Afghanistan',19987071,20595360],
['Brazil',172006362,174504898],
['China',1272915272,1280428583]],
columns=['country','1999','2000'])
df4b
| country | 1999 | 2000 | |
|---|---|---|---|
| 0 | Afghanistan | 19987071 | 20595360 |
| 1 | Brazil | 172006362 | 174504898 |
| 2 | China | 1272915272 | 1280428583 |
These are all representations of the same underlying data, but they are not equally easy to use. The tidy dataset will be much easier to work with.
Given each of the four dataframes, try to write a script to answer the following questions.
# df1
df1[(df1.country=='China') & (df1.year==2000)][['cases']]
| cases | |
|---|---|
| 5 | 213766 |
# df2
df2[(df2.country=='China')
& (df2.year==2000)
& (df2.type=='cases')][['count']]
| count | |
|---|---|
| 10 | 213766 |
# df3
df3[(df3.country=='China') & (df3.year==2000)].rate.str.split('/').str[0]
# df3[(df3.country=='China') & (df3.year==2000)][['cases']]
5 213766 Name: rate, dtype: object
# df4 (a or b)
df4a[df4a.country=='China'][['2000']]
| 2000 | |
|---|---|
| 2 | 213766 |
# df1
df1[df1.year==1999].cases.sum()
250740
# df2
df2[(df2.year==1999) & (df2.type=='cases')][['count']].sum()
count 250740 dtype: int64
(
df2[(df2.year==1999) & (df2.type=='cases')]
.groupby('year')
.agg({'count':'sum'})
.reset_index()
)
| year | count | |
|---|---|---|
| 0 | 1999 | 250740 |
# df3
df3
| country | year | rate | |
|---|---|---|---|
| 0 | Afghanistan | 1999 | 745/19987071 |
| 1 | Afghanistan | 2000 | 2666/20595360 |
| 2 | Brazil | 1999 | 37737/172006362 |
| 3 | Brazil | 2000 | 80488/174504898 |
| 4 | China | 1999 | 212258/1272915272 |
| 5 | China | 2000 | 213766/1280428583 |
# df4 (a or b)
df4a['1999'].sum()
250740
There are three interrelated rules which make a dataset tidy:
from IPython.display import Image
Image('https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png')
Among the four dataframes above, which one is tidy? df1
Tidy data makes a lot of operations easier.
# Creating a new column
# Compute rate per 10,000
df1['rate'] = df1.cases / df1.population *10000
df1
| country | year | cases | population | rate | |
|---|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 | 19987071 | 0.372741 |
| 1 | Afghanistan | 2000 | 2666 | 20595360 | 1.294466 |
| 2 | Brazil | 1999 | 37737 | 172006362 | 2.193930 |
| 3 | Brazil | 2000 | 80488 | 174504898 | 4.612363 |
| 4 | China | 1999 | 212258 | 1272915272 | 1.667495 |
| 5 | China | 2000 | 213766 | 1280428583 | 1.669488 |
# Data summarization
# Compute cases per year
(
df1
.groupby('year')
.agg({'cases':'sum'})
.reset_index()
)
| year | cases | |
|---|---|---|
| 0 | 1999 | 250740 |
| 1 | 2000 | 296920 |
# For each country, calculate the total number of cases and average population.
(
df1
.groupby('country')
.agg({'cases':'sum'
,'population':'mean'})
.reset_index()
)
| country | cases | population | |
|---|---|---|---|
| 0 | Afghanistan | 3411 | 2.029122e+07 |
| 1 | Brazil | 118225 | 1.732556e+08 |
| 2 | China | 426024 | 1.276672e+09 |
Tidy data is good. Unfortunately, in practice, most data that you will encounter will be untidy in one way or another. Hence, for most real analyses, you’ll need to do some tidying:
A common problem is a dataset where some of the column names are not names of variables, but values of a variable.
Take df4a: the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.
df4a
| country | 1999 | 2000 | |
|---|---|---|---|
| 0 | Afghanistan | 745 | 2666 |
| 1 | Brazil | 37737 | 80488 |
| 2 | China | 212258 | 213766 |
# Use melt() to gather columns into rows
tidy4a = df4a.melt(id_vars=['country'], value_vars=['1999','2000'],
var_name='year', value_name='cases')
tidy4a
| country | year | cases | |
|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 |
| 1 | Brazil | 1999 | 37737 |
| 2 | China | 1999 | 212258 |
| 3 | Afghanistan | 2000 | 2666 |
| 4 | Brazil | 2000 | 80488 |
| 5 | China | 2000 | 213766 |
# Similarly, apply melt() to df4b
tidy4b = df4b.melt(id_vars=['country'], value_vars=['1999','2000'],
var_name='year', value_name='population')
tidy4b
| country | year | population | |
|---|---|---|---|
| 0 | Afghanistan | 1999 | 19987071 |
| 1 | Brazil | 1999 | 172006362 |
| 2 | China | 1999 | 1272915272 |
| 3 | Afghanistan | 2000 | 20595360 |
| 4 | Brazil | 2000 | 174504898 |
| 5 | China | 2000 | 1280428583 |
# Join the two dataframes into one
tidy4 = pd.merge(tidy4a, tidy4b, on=['country','year'])
tidy4
| country | year | cases | population | |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 | 19987071 |
| 1 | Brazil | 1999 | 37737 | 172006362 |
| 2 | China | 1999 | 212258 | 1272915272 |
| 3 | Afghanistan | 2000 | 2666 | 20595360 |
| 4 | Brazil | 2000 | 80488 | 174504898 |
| 5 | China | 2000 | 213766 | 1280428583 |
In contrast, another problem is that an observation is scattered across multiple rows.
For example, take df2: an observation is a country in a year, but each observation is spread across two rows.
df2
| country | year | type | count | |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | cases | 745 |
| 1 | Afghanistan | 1999 | population | 19987071 |
| 2 | Afghanistan | 2000 | cases | 2666 |
| 3 | Afghanistan | 2000 | population | 20595360 |
| 4 | Brazil | 1999 | cases | 37737 |
| 5 | Brazil | 1999 | population | 172006362 |
| 6 | Brazil | 2000 | cases | 80488 |
| 7 | Brazil | 2000 | population | 174504898 |
| 8 | China | 1999 | cases | 212258 |
| 9 | China | 1999 | population | 1272915272 |
| 10 | China | 2000 | cases | 213766 |
| 11 | China | 2000 | population | 1280428583 |
df2.pivot(index=['country','year'], columns='type', values='count')
| type | cases | population | |
|---|---|---|---|
| country | year | ||
| Afghanistan | 1999 | 745 | 19987071 |
| 2000 | 2666 | 20595360 | |
| Brazil | 1999 | 37737 | 172006362 |
| 2000 | 80488 | 174504898 | |
| China | 1999 | 212258 | 1272915272 |
| 2000 | 213766 | 1280428583 |
df2.pivot(index=['country','year'], columns='type', values='count').reset_index()
| type | country | year | cases | population |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 | 19987071 |
| 1 | Afghanistan | 2000 | 2666 | 20595360 |
| 2 | Brazil | 1999 | 37737 | 172006362 |
| 3 | Brazil | 2000 | 80488 | 174504898 |
| 4 | China | 1999 | 212258 | 1272915272 |
| 5 | China | 2000 | 213766 | 1280428583 |
# Alternatively, you can use the pivot_table() method
df2.pivot_table(index=['country','year'], columns='type', values='count').reset_index()
| type | country | year | cases | population |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 | 19987071 |
| 1 | Afghanistan | 2000 | 2666 | 20595360 |
| 2 | Brazil | 1999 | 37737 | 172006362 |
| 3 | Brazil | 2000 | 80488 | 174504898 |
| 4 | China | 1999 | 212258 | 1272915272 |
| 5 | China | 2000 | 213766 | 1280428583 |
df3
| country | year | rate | |
|---|---|---|---|
| 0 | Afghanistan | 1999 | 745/19987071 |
| 1 | Afghanistan | 2000 | 2666/20595360 |
| 2 | Brazil | 1999 | 37737/172006362 |
| 3 | Brazil | 2000 | 80488/174504898 |
| 4 | China | 1999 | 212258/1272915272 |
| 5 | China | 2000 | 213766/1280428583 |
df3[['cases','population']] = df3['rate'].str.split("/", expand=True).astype(int)
df3
| country | year | rate | cases | population | |
|---|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745/19987071 | 745 | 19987071 |
| 1 | Afghanistan | 2000 | 2666/20595360 | 2666 | 20595360 |
| 2 | Brazil | 1999 | 37737/172006362 | 37737 | 172006362 |
| 3 | Brazil | 2000 | 80488/174504898 | 80488 | 174504898 |
| 4 | China | 1999 | 212258/1272915272 | 212258 | 1272915272 |
| 5 | China | 2000 | 213766/1280428583 | 213766 | 1280428583 |
df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 6 non-null object 1 year 6 non-null int64 2 rate 6 non-null object 3 cases 6 non-null int32 4 population 6 non-null int32 dtypes: int32(2), int64(1), object(2) memory usage: 320.0+ bytes
df3 = df3[['country','year','cases','population']]
df3
| country | year | cases | population | |
|---|---|---|---|---|
| 0 | Afghanistan | 1999 | 745 | 19987071 |
| 1 | Afghanistan | 2000 | 2666 | 20595360 |
| 2 | Brazil | 1999 | 37737 | 172006362 |
| 3 | Brazil | 2000 | 80488 | 174504898 |
| 4 | China | 1999 | 212258 | 1272915272 |
| 5 | China | 2000 | 213766 | 1280428583 |
#df3.drop(columns = ['rate','population'])
#df3.drop(columns=['rate'], inplace=True)
#df3
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:
stocks = pd.DataFrame(
[[2015,1,1.88],
[2015,2,0.59],
[2015,3,0.35],
[2015,4,None],
[2016,2,0.92],
[2016,3,0.17],
[2016,4,2.66]],
columns=['year','qtr','return'])
stocks
| year | qtr | return | |
|---|---|---|---|
| 0 | 2015 | 1 | 1.88 |
| 1 | 2015 | 2 | 0.59 |
| 2 | 2015 | 3 | 0.35 |
| 3 | 2015 | 4 | NaN |
| 4 | 2016 | 2 | 0.92 |
| 5 | 2016 | 3 | 0.17 |
| 6 | 2016 | 4 | 2.66 |
There are two missing values in this dataset:
An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
# To reveal all missing vaules explicitly
stocks2 = stocks.pivot(index='qtr', columns='year', values='return').reset_index()
stocks2
| year | qtr | 2015 | 2016 |
|---|---|---|---|
| 0 | 1 | 1.88 | NaN |
| 1 | 2 | 0.59 | 0.92 |
| 2 | 3 | 0.35 | 0.17 |
| 3 | 4 | NaN | 2.66 |
# Reshape the dataframe back to a tidy form
stocks2.melt(id_vars='qtr', value_vars=[2015,2016],
var_name='year', value_name='return')
| qtr | year | return | |
|---|---|---|---|
| 0 | 1 | 2015 | 1.88 |
| 1 | 2 | 2015 | 0.59 |
| 2 | 3 | 2015 | 0.35 |
| 3 | 4 | 2015 | NaN |
| 4 | 1 | 2016 | NaN |
| 5 | 2 | 2016 | 0.92 |
| 6 | 3 | 2016 | 0.17 |
| 7 | 4 | 2016 | 2.66 |
The dataset "who.csv" contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
who = pd.read_csv('../data/who.csv', header=0)
who.head()
| country | iso2 | iso3 | year | new_sp_m014 | new_sp_m1524 | new_sp_m2534 | new_sp_m3544 | new_sp_m4554 | new_sp_m5564 | ... | newrel_m4554 | newrel_m5564 | newrel_m65 | newrel_f014 | newrel_f1524 | newrel_f2534 | newrel_f3544 | newrel_f4554 | newrel_f5564 | newrel_f65 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 1980 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Afghanistan | AF | AFG | 1981 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Afghanistan | AF | AFG | 1982 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Afghanistan | AF | AFG | 1983 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Afghanistan | AF | AFG | 1984 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 60 columns
who.shape
(7240, 60)
who.columns
Index(['country', 'iso2', 'iso3', 'year', 'new_sp_m014', 'new_sp_m1524',
'new_sp_m2534', 'new_sp_m3544', 'new_sp_m4554', 'new_sp_m5564',
'new_sp_m65', 'new_sp_f014', 'new_sp_f1524', 'new_sp_f2534',
'new_sp_f3544', 'new_sp_f4554', 'new_sp_f5564', 'new_sp_f65',
'new_sn_m014', 'new_sn_m1524', 'new_sn_m2534', 'new_sn_m3544',
'new_sn_m4554', 'new_sn_m5564', 'new_sn_m65', 'new_sn_f014',
'new_sn_f1524', 'new_sn_f2534', 'new_sn_f3544', 'new_sn_f4554',
'new_sn_f5564', 'new_sn_f65', 'new_ep_m014', 'new_ep_m1524',
'new_ep_m2534', 'new_ep_m3544', 'new_ep_m4554', 'new_ep_m5564',
'new_ep_m65', 'new_ep_f014', 'new_ep_f1524', 'new_ep_f2534',
'new_ep_f3544', 'new_ep_f4554', 'new_ep_f5564', 'new_ep_f65',
'newrel_m014', 'newrel_m1524', 'newrel_m2534', 'newrel_m3544',
'newrel_m4554', 'newrel_m5564', 'newrel_m65', 'newrel_f014',
'newrel_f1524', 'newrel_f2534', 'newrel_f3544', 'newrel_f4554',
'newrel_f5564', 'newrel_f65'],
dtype='object')
It is obvious that this dataset is not tidy. So let's tidy it up.
The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:
country, iso2, and iso3 are three variables that redundantly specify the country.year is clearly also a variable.new_sp_m014, new_ep_m014, new_ep_f014) these are likely to be values, not variables.# If value_vars is not specified, melt() uses all columns that are not set as id_vars.
who1 = who.melt(id_vars=['country','iso2','iso3','year'],
var_name='key', value_name='cases')
who1
| country | iso2 | iso3 | year | key | cases | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | AFG | 1980 | new_sp_m014 | NaN |
| 1 | Afghanistan | AF | AFG | 1981 | new_sp_m014 | NaN |
| 2 | Afghanistan | AF | AFG | 1982 | new_sp_m014 | NaN |
| 3 | Afghanistan | AF | AFG | 1983 | new_sp_m014 | NaN |
| 4 | Afghanistan | AF | AFG | 1984 | new_sp_m014 | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 405435 | Zimbabwe | ZW | ZWE | 2009 | newrel_f65 | NaN |
| 405436 | Zimbabwe | ZW | ZWE | 2010 | newrel_f65 | NaN |
| 405437 | Zimbabwe | ZW | ZWE | 2011 | newrel_f65 | NaN |
| 405438 | Zimbabwe | ZW | ZWE | 2012 | newrel_f65 | NaN |
| 405439 | Zimbabwe | ZW | ZWE | 2013 | newrel_f65 | 725.0 |
405440 rows × 6 columns
# Drop rows with missing values
who1 = who1.dropna()
who1
| country | iso2 | iso3 | year | key | cases | |
|---|---|---|---|---|---|---|
| 17 | Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0.0 |
| 18 | Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30.0 |
| 19 | Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8.0 |
| 20 | Afghanistan | AF | AFG | 2000 | new_sp_m014 | 52.0 |
| 21 | Afghanistan | AF | AFG | 2001 | new_sp_m014 | 129.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 405269 | Viet Nam | VN | VNM | 2013 | newrel_f65 | 3110.0 |
| 405303 | Wallis and Futuna Islands | WF | WLF | 2013 | newrel_f65 | 2.0 |
| 405371 | Yemen | YE | YEM | 2013 | newrel_f65 | 360.0 |
| 405405 | Zambia | ZM | ZMB | 2013 | newrel_f65 | 669.0 |
| 405439 | Zimbabwe | ZW | ZWE | 2013 | newrel_f65 | 725.0 |
75752 rows × 6 columns
# Let's try to figure out the meaning of the new column: key
# Count the number of rows for each key value
who1['key'].value_counts()
new_sp_m4554 3205 new_sp_m3544 3201 new_sp_m5564 3200 new_sp_m65 3191 new_sp_m1524 3191 new_sp_m2534 3188 new_sp_f4554 3186 new_sp_f2534 3182 new_sp_f3544 3181 new_sp_f65 3179 new_sp_f5564 3177 new_sp_f1524 3176 new_sp_f014 3156 new_sp_m014 3155 new_sn_m014 1044 new_sn_f014 1039 new_ep_m014 1037 new_ep_f014 1031 new_sn_m1524 1029 new_sn_m4554 1026 new_ep_m1524 1025 new_sn_m3544 1024 new_ep_m3544 1023 new_sn_m2534 1021 new_sn_f1524 1021 new_ep_f3544 1020 new_ep_f2534 1020 new_ep_f1524 1020 new_sn_m5564 1020 new_ep_m2534 1019 new_ep_m4554 1019 new_sn_f3544 1019 new_sn_m65 1019 new_sn_f65 1018 new_sn_f4554 1017 new_ep_m65 1017 new_ep_f4554 1016 new_ep_f5564 1016 new_sn_f5564 1016 new_sn_f2534 1015 new_ep_m5564 1014 new_ep_f65 1013 newrel_f014 189 newrel_m014 189 newrel_m5564 184 newrel_f65 184 newrel_m3544 183 newrel_m4554 183 newrel_f1524 183 newrel_m2534 182 newrel_f3544 182 newrel_f4554 182 newrel_f5564 182 newrel_m65 181 newrel_m1524 181 newrel_f2534 181 Name: key, dtype: int64
According to WHO's data dictionary:
new or old cases of TB. In this dataset, each column contains new cases.rel stands for cases of relapseep stands for cases of extrapulmonary TBsn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)sp stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)m) and females (f).014 = 0 – 14 years old1524 = 15 – 24 years old2534 = 25 – 34 years old3544 = 35 – 44 years old4554 = 45 – 54 years old5564 = 55 – 64 years old65 = 65 or olderWe want to separate the column key to multiple columns.
# Since all cases in this dataset are new, not old. We do not need to deal with the first three letters too much.
# The second group should be "type" but the value 'rel' is a bit different from others
# To make things consistent, we add a '_' between 'new' and 'rel'.
who2 = who1.copy()
who2['key'] = who2['key'].str.replace('newrel','new_rel')
who2['key'].unique()
array(['new_sp_m014', 'new_sp_m1524', 'new_sp_m2534', 'new_sp_m3544',
'new_sp_m4554', 'new_sp_m5564', 'new_sp_m65', 'new_sp_f014',
'new_sp_f1524', 'new_sp_f2534', 'new_sp_f3544', 'new_sp_f4554',
'new_sp_f5564', 'new_sp_f65', 'new_sn_m014', 'new_sn_m1524',
'new_sn_m2534', 'new_sn_m3544', 'new_sn_m4554', 'new_sn_m5564',
'new_sn_m65', 'new_sn_f014', 'new_sn_f1524', 'new_sn_f2534',
'new_sn_f3544', 'new_sn_f4554', 'new_sn_f5564', 'new_sn_f65',
'new_ep_m014', 'new_ep_m1524', 'new_ep_m2534', 'new_ep_m3544',
'new_ep_m4554', 'new_ep_m5564', 'new_ep_m65', 'new_ep_f014',
'new_ep_f1524', 'new_ep_f2534', 'new_ep_f3544', 'new_ep_f4554',
'new_ep_f5564', 'new_ep_f65', 'new_rel_m014', 'new_rel_m1524',
'new_rel_m2534', 'new_rel_m3544', 'new_rel_m4554', 'new_rel_m5564',
'new_rel_m65', 'new_rel_f014', 'new_rel_f1524', 'new_rel_f2534',
'new_rel_f3544', 'new_rel_f4554', 'new_rel_f5564', 'new_rel_f65'],
dtype=object)
# Create three new columns by splitting the key column
who2[["new", "type", "sexage"]] = who2['key'].str.split("_", expand=True)
who2
| country | iso2 | iso3 | year | key | cases | new | type | sexage | |
|---|---|---|---|---|---|---|---|---|---|
| 17 | Afghanistan | AF | AFG | 1997 | new_sp_m014 | 0.0 | new | sp | m014 |
| 18 | Afghanistan | AF | AFG | 1998 | new_sp_m014 | 30.0 | new | sp | m014 |
| 19 | Afghanistan | AF | AFG | 1999 | new_sp_m014 | 8.0 | new | sp | m014 |
| 20 | Afghanistan | AF | AFG | 2000 | new_sp_m014 | 52.0 | new | sp | m014 |
| 21 | Afghanistan | AF | AFG | 2001 | new_sp_m014 | 129.0 | new | sp | m014 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 405269 | Viet Nam | VN | VNM | 2013 | new_rel_f65 | 3110.0 | new | rel | f65 |
| 405303 | Wallis and Futuna Islands | WF | WLF | 2013 | new_rel_f65 | 2.0 | new | rel | f65 |
| 405371 | Yemen | YE | YEM | 2013 | new_rel_f65 | 360.0 | new | rel | f65 |
| 405405 | Zambia | ZM | ZMB | 2013 | new_rel_f65 | 669.0 | new | rel | f65 |
| 405439 | Zimbabwe | ZW | ZWE | 2013 | new_rel_f65 | 725.0 | new | rel | f65 |
75752 rows × 9 columns
# Drop the redundant columns
who2.drop(columns=['iso2','iso3','key','new'], inplace=True)
who2
| country | year | cases | type | sexage | |
|---|---|---|---|---|---|
| 17 | Afghanistan | 1997 | 0.0 | sp | m014 |
| 18 | Afghanistan | 1998 | 30.0 | sp | m014 |
| 19 | Afghanistan | 1999 | 8.0 | sp | m014 |
| 20 | Afghanistan | 2000 | 52.0 | sp | m014 |
| 21 | Afghanistan | 2001 | 129.0 | sp | m014 |
| ... | ... | ... | ... | ... | ... |
| 405269 | Viet Nam | 2013 | 3110.0 | rel | f65 |
| 405303 | Wallis and Futuna Islands | 2013 | 2.0 | rel | f65 |
| 405371 | Yemen | 2013 | 360.0 | rel | f65 |
| 405405 | Zambia | 2013 | 669.0 | rel | f65 |
| 405439 | Zimbabwe | 2013 | 725.0 | rel | f65 |
75752 rows × 5 columns
# Separate sexage into sex and age by splitting after the first character
who2['sex'] = who2['sexage'].str[:1]
who2['age'] = who2['sexage'].str[1:]
who2
| country | year | cases | type | sexage | sex | age | |
|---|---|---|---|---|---|---|---|
| 17 | Afghanistan | 1997 | 0.0 | sp | m014 | m | 014 |
| 18 | Afghanistan | 1998 | 30.0 | sp | m014 | m | 014 |
| 19 | Afghanistan | 1999 | 8.0 | sp | m014 | m | 014 |
| 20 | Afghanistan | 2000 | 52.0 | sp | m014 | m | 014 |
| 21 | Afghanistan | 2001 | 129.0 | sp | m014 | m | 014 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 405269 | Viet Nam | 2013 | 3110.0 | rel | f65 | f | 65 |
| 405303 | Wallis and Futuna Islands | 2013 | 2.0 | rel | f65 | f | 65 |
| 405371 | Yemen | 2013 | 360.0 | rel | f65 | f | 65 |
| 405405 | Zambia | 2013 | 669.0 | rel | f65 | f | 65 |
| 405439 | Zimbabwe | 2013 | 725.0 | rel | f65 | f | 65 |
75752 rows × 7 columns
who2.drop(columns='sexage', inplace=True)
who2
| country | year | cases | type | sex | age | |
|---|---|---|---|---|---|---|
| 17 | Afghanistan | 1997 | 0.0 | sp | m | 014 |
| 18 | Afghanistan | 1998 | 30.0 | sp | m | 014 |
| 19 | Afghanistan | 1999 | 8.0 | sp | m | 014 |
| 20 | Afghanistan | 2000 | 52.0 | sp | m | 014 |
| 21 | Afghanistan | 2001 | 129.0 | sp | m | 014 |
| ... | ... | ... | ... | ... | ... | ... |
| 405269 | Viet Nam | 2013 | 3110.0 | rel | f | 65 |
| 405303 | Wallis and Futuna Islands | 2013 | 2.0 | rel | f | 65 |
| 405371 | Yemen | 2013 | 360.0 | rel | f | 65 |
| 405405 | Zambia | 2013 | 669.0 | rel | f | 65 |
| 405439 | Zimbabwe | 2013 | 725.0 | rel | f | 65 |
75752 rows × 6 columns
Now, the dataset is tidy!
who2 = who2.reset_index()
who2
| index | country | year | cases | type | sex | age | |
|---|---|---|---|---|---|---|---|
| 0 | 17 | Afghanistan | 1997 | 0.0 | sp | m | 014 |
| 1 | 18 | Afghanistan | 1998 | 30.0 | sp | m | 014 |
| 2 | 19 | Afghanistan | 1999 | 8.0 | sp | m | 014 |
| 3 | 20 | Afghanistan | 2000 | 52.0 | sp | m | 014 |
| 4 | 21 | Afghanistan | 2001 | 129.0 | sp | m | 014 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 75747 | 405269 | Viet Nam | 2013 | 3110.0 | rel | f | 65 |
| 75748 | 405303 | Wallis and Futuna Islands | 2013 | 2.0 | rel | f | 65 |
| 75749 | 405371 | Yemen | 2013 | 360.0 | rel | f | 65 |
| 75750 | 405405 | Zambia | 2013 | 669.0 | rel | f | 65 |
| 75751 | 405439 | Zimbabwe | 2013 | 725.0 | rel | f | 65 |
75752 rows × 7 columns
who2.drop(columns ='index', inplace = True)
who2
| country | year | cases | type | sex | age | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | 1997 | 0.0 | sp | m | 014 |
| 1 | Afghanistan | 1998 | 30.0 | sp | m | 014 |
| 2 | Afghanistan | 1999 | 8.0 | sp | m | 014 |
| 3 | Afghanistan | 2000 | 52.0 | sp | m | 014 |
| 4 | Afghanistan | 2001 | 129.0 | sp | m | 014 |
| ... | ... | ... | ... | ... | ... | ... |
| 75747 | Viet Nam | 2013 | 3110.0 | rel | f | 65 |
| 75748 | Wallis and Futuna Islands | 2013 | 2.0 | rel | f | 65 |
| 75749 | Yemen | 2013 | 360.0 | rel | f | 65 |
| 75750 | Zambia | 2013 | 669.0 | rel | f | 65 |
| 75751 | Zimbabwe | 2013 | 725.0 | rel | f | 65 |
75752 rows × 6 columns
# For each country, what is the total number of cases for male vs female
(
who2
.groupby(['country','sex'])
.agg({'cases':'sum'})
.reset_index()
.pivot(index='country', columns='sex', values='cases')
.reset_index()
)
| sex | country | f | m |
|---|---|---|---|
| 0 | Afghanistan | 93354.0 | 46871.0 |
| 1 | Albania | 1830.0 | 3505.0 |
| 2 | Algeria | 50522.0 | 77597.0 |
| 3 | American Samoa | 21.0 | 20.0 |
| 4 | Andorra | 41.0 | 62.0 |
| ... | ... | ... | ... |
| 213 | Wallis and Futuna Islands | 20.0 | 21.0 |
| 214 | West Bank and Gaza Strip | 107.0 | 197.0 |
| 215 | Yemen | 39901.0 | 44261.0 |
| 216 | Zambia | 111026.0 | 152812.0 |
| 217 | Zimbabwe | 172729.0 | 187686.0 |
218 rows × 3 columns
In data analysis, you often need to combine multiple data sets to answer the questions that you are interested in.
Collectively, multiple related sets (tables) of data are called relational data. In relational (SQL) databases (DBs), each table is called a relation. Two tables (relations) may have a relationship between each other via a PK (primary key) and a FK (foreign key). It is also not uncommon to have more than two tables related to each other.
To work with relational data, we typically need to from three families of operations:
If you have learned relational databases and SQL (Structured Query Language), you should find many of these concepts and operations familiar.
We will use the nycflights13 package to learn about relational data.
import pandas as pd
# Install the "nycflights13" package before you run the following code.
from nycflights13 import flights
from nycflights13 import airlines
from nycflights13 import airports
from nycflights13 import planes
from nycflights13 import weather
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
airlines.head()
| carrier | name | |
|---|---|---|
| 0 | 9E | Endeavor Air Inc. |
| 1 | AA | American Airlines Inc. |
| 2 | AS | Alaska Airlines Inc. |
| 3 | B6 | JetBlue Airways |
| 4 | DL | Delta Air Lines Inc. |
airports.head()
| faa | name | lat | lon | alt | tz | dst | tzone | |
|---|---|---|---|---|---|---|---|---|
| 0 | 04G | Lansdowne Airport | 41.130472 | -80.619583 | 1044 | -5 | A | America/New_York |
| 1 | 06A | Moton Field Municipal Airport | 32.460572 | -85.680028 | 264 | -6 | A | America/Chicago |
| 2 | 06C | Schaumburg Regional | 41.989341 | -88.101243 | 801 | -6 | A | America/Chicago |
| 3 | 06N | Randall Airport | 41.431912 | -74.391561 | 523 | -5 | A | America/New_York |
| 4 | 09J | Jekyll Island Airport | 31.074472 | -81.427778 | 11 | -5 | A | America/New_York |
planes.head()
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | N10156 | 2004.0 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NaN | Turbo-fan |
| 1 | N102UW | 1998.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
| 2 | N103US | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
| 3 | N104UW | 1999.0 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NaN | Turbo-fan |
| 4 | N10575 | 2002.0 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NaN | Turbo-fan |
weather.head()
| origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EWR | 2013 | 1 | 1 | 1 | 39.02 | 26.06 | 59.37 | 270.0 | 10.35702 | NaN | 0.0 | 1012.0 | 10.0 | 2013-01-01T06:00:00Z |
| 1 | EWR | 2013 | 1 | 1 | 2 | 39.02 | 26.96 | 61.63 | 250.0 | 8.05546 | NaN | 0.0 | 1012.3 | 10.0 | 2013-01-01T07:00:00Z |
| 2 | EWR | 2013 | 1 | 1 | 3 | 39.02 | 28.04 | 64.43 | 240.0 | 11.50780 | NaN | 0.0 | 1012.5 | 10.0 | 2013-01-01T08:00:00Z |
| 3 | EWR | 2013 | 1 | 1 | 4 | 39.92 | 28.04 | 62.21 | 250.0 | 12.65858 | NaN | 0.0 | 1012.2 | 10.0 | 2013-01-01T09:00:00Z |
| 4 | EWR | 2013 | 1 | 1 | 5 | 39.02 | 28.04 | 64.43 | 260.0 | 12.65858 | NaN | 0.0 | 1011.9 | 10.0 | 2013-01-01T10:00:00Z |
The relationships between these tables are shown in the following diagram:
from IPython.display import Image
Image('https://d33wubrfki0l68.cloudfront.net/245292d1ea724f6c3fd8a92063dcd7bfb9758d02/5751b/diagrams/relational-nycflights.png')
For nycflights13:
flights connects to plane via a single variable, tailnum.flights connects to airlines through the carrier variable.flights connects to airports in two ways: via the origin and dest variables.flights connects to weather via origin (the location), and year, month, day and hour (the time).The variables used to connect each pair of tables are called keys. A key is a variable (or set of varialbes) that uniquely identifies an observation.
For example, each plane is uniquely identified by its tailnum. In other cases, multiple variables may be needed. For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.
There are two types of keys:
planes table, tailnum is a primary key because it uniquely identifies each plane in the planes table.flights table, tailnum is a foreign key because it matches each flight to a unique plane.Once you've identified the primary keys in your tables, it is good practice to verify that they do indeed uniquely identify each observation.
# Count the number of rows in the table "planes"
planes.shape[0]
3322
# Count the number of unique values in column "tailnum"
planes.tailnum.nunique()
3322
# Count the occurrence of different values in column "tailnum"
planes.tailnum.value_counts()
N10156 1
N709EV 1
N706JB 1
N706SW 1
N706TW 1
..
N395HA 1
N395SW 1
N396DA 1
N396SW 1
N999DN 1
Name: tailnum, Length: 3322, dtype: int64
planes.describe(include='all')
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine | |
|---|---|---|---|---|---|---|---|---|---|
| count | 3322 | 3252.000000 | 3322 | 3322 | 3322 | 3322.000000 | 3322.000000 | 23.000000 | 3322 |
| unique | 3322 | NaN | 3 | 35 | 127 | NaN | NaN | NaN | 6 |
| top | N10156 | NaN | Fixed wing multi engine | BOEING | 737-7H4 | NaN | NaN | NaN | Turbo-fan |
| freq | 1 | NaN | 3292 | 1630 | 361 | NaN | NaN | NaN | 2750 |
| mean | NaN | 2000.484010 | NaN | NaN | NaN | 1.995184 | 154.316376 | 236.782609 | NaN |
| std | NaN | 7.193425 | NaN | NaN | NaN | 0.117593 | 73.654974 | 149.759794 | NaN |
| min | NaN | 1956.000000 | NaN | NaN | NaN | 1.000000 | 2.000000 | 90.000000 | NaN |
| 25% | NaN | 1997.000000 | NaN | NaN | NaN | 2.000000 | 140.000000 | 107.500000 | NaN |
| 50% | NaN | 2001.000000 | NaN | NaN | NaN | 2.000000 | 149.000000 | 162.000000 | NaN |
| 75% | NaN | 2005.000000 | NaN | NaN | NaN | 2.000000 | 182.000000 | 432.000000 | NaN |
| max | NaN | 2013.000000 | NaN | NaN | NaN | 4.000000 | 450.000000 | 432.000000 | NaN |
# Try to determine the key for flights: ['year','month','day','carrier','flight']?
flights[['year','month','day','carrier','flight']].value_counts()
year month day carrier flight
2013 8 13 UA 236 2
23 UA 236 2
15 UA 236 2
19 UA 207 2
6 15 WN 2269 2
..
5 4 B6 604 1
600 1
553 1
547 1
12 31 YV 3771 1
Length: 336752, dtype: int64
(
flights
.groupby(['year','month','day','carrier','flight'])
.agg({'sched_dep_time':'count'})
.reset_index()
)
| year | month | day | carrier | flight | sched_dep_time | |
|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 9E | 3286 | 1 |
| 1 | 2013 | 1 | 1 | 9E | 3295 | 1 |
| 2 | 2013 | 1 | 1 | 9E | 3320 | 1 |
| 3 | 2013 | 1 | 1 | 9E | 3321 | 1 |
| 4 | 2013 | 1 | 1 | 9E | 3325 | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| 336747 | 2013 | 12 | 31 | WN | 2868 | 1 |
| 336748 | 2013 | 12 | 31 | WN | 3566 | 1 |
| 336749 | 2013 | 12 | 31 | WN | 3778 | 1 |
| 336750 | 2013 | 12 | 31 | YV | 2885 | 1 |
| 336751 | 2013 | 12 | 31 | YV | 3771 | 1 |
336752 rows × 6 columns
# Adding more variable(s) to make a key?
flights[['year','month','day','carrier','flight','sched_arr_time']].value_counts()
year month day carrier flight sched_dep_time
2013 1 1 9E 3286 1829 1
8 31 DL 947 1900 1
884 1745 1
874 930 1
873 915 1
..
5 4 B6 671 700 1
658 1245 1
649 2015 1
647 1732 1
12 31 YV 3771 1432 1
Length: 336776, dtype: int64
Sometimes a table doesn’t have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it.
If a table lacks a primary key or a non-composite key, it’s sometimes useful to add one, e.g., its row number. That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.
# Add a surrogate key to flights.
flights['id'] = flights.index
flights.head()
| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z | 0 |
| 1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z | 1 |
| 2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z | 2 |
| 3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z | 3 |
| 4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z | 4 |
Join two data sets/tables by the PK-FK relationship.
Create two datasets, adf and bdf:
adf = pd.DataFrame(
[['A',1],
['B',2],
['C',3]],
columns=['x1', 'x2'])
adf
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
bdf = pd.DataFrame(
[['A',True],
['B',False],
['D',True]],
columns=['x1', 'x3'])
bdf
| x1 | x3 | |
|---|---|---|
| 0 | A | True |
| 1 | B | False |
| 2 | D | True |
The simplest type of join is the inner join. An inner join matches pairs of observations whenever their keys are equal:
pd.merge(adf, bdf, how='inner', on='x1')
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | A | 1 | True |
| 1 | B | 2 | False |
# You can do this as well.
adf.merge(bdf, how='inner', on='x1')
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | A | 1 | True |
| 1 | B | 2 | False |
# Left join
pd.merge(adf, bdf, how='left', on='x1')
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | A | 1 | True |
| 1 | B | 2 | False |
| 2 | C | 3 | NaN |
# Right join
pd.merge(adf, bdf, how='right', on='x1')
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | A | 1.0 | True |
| 1 | B | 2.0 | False |
| 2 | D | NaN | True |
# Full outer join
pd.merge(adf, bdf, how='outer', on='x1')
| x1 | x2 | x3 | |
|---|---|---|---|
| 0 | A | 1.0 | True |
| 1 | B | 2.0 | False |
| 2 | C | 3.0 | NaN |
| 3 | D | NaN | True |
# All rows in adf that have a match in bdf
adf[adf.x1.isin(bdf.x1)]
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
# All rows in adf that do not have a match in bdf
adf[~adf.x1.isin(bdf.x1)]
| x1 | x2 | |
|---|---|---|
| 2 | C | 3 |
In Pandas, the merge() method can also be used for set-like operations, such as union, intersection, and set-difference. All these operations work with a complete row, comparing the values of every variable.
Take these as examples:
xdf = pd.DataFrame(
[['A',1],
['B',2],
['C',3]],
columns=['x1', 'x2'])
xdf
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
ydf = pd.DataFrame(
[['B',2],
['C',3],
['D',4]],
columns=['x1', 'x2'])
ydf
| x1 | x2 | |
|---|---|---|
| 0 | B | 2 |
| 1 | C | 3 |
| 2 | D | 4 |
# Union: Rows that appear in either or both xdf and ydf
# pd.merge(xdf, ydf, how='outer')
pd.merge(xdf, ydf, how='outer', on=['x1','x2'])
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
| 3 | D | 4 |
# Intersection: Rows that appear in both xdf and ydf
# pd.merge(xdf, ydf, how='inner')
pd.merge(xdf, ydf, how='inner', on=['x1','x2'])
| x1 | x2 | |
|---|---|---|
| 0 | B | 2 |
| 1 | C | 3 |
# Difference: Rows that appear in xdf but not in ydf
pd.merge(xdf, ydf, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
pd.merge(xdf, ydf, how='outer', indicator=True).query('_merge == "left_only"')
#.drop(columns=['_merge'])
| x1 | x2 | _merge | |
|---|---|---|---|
| 0 | A | 1 | left_only |
When working with multiple dataframes, we often need to combine them by rows or by columns. This is when we need to use the conact() method.
# xdf and ydf have the same variables (columns)
# Append rows in ydf to xdf
pd.concat([xdf, ydf], axis=0)
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
| 0 | B | 2 |
| 1 | C | 3 |
| 2 | D | 4 |
pd.concat([xdf, ydf], axis=0).drop_duplicates()
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
| 2 | D | 4 |
pd.merge(xdf, ydf, how='outer')
| x1 | x2 | |
|---|---|---|
| 0 | A | 1 |
| 1 | B | 2 |
| 2 | C | 3 |
| 3 | D | 4 |
# Create a new dataframe zdf
zdf = pd.DataFrame(
[[True, 4],
[False, 5],
[False, 6]],
columns=['x3','x4'])
zdf
| x3 | x4 | |
|---|---|---|
| 0 | True | 4 |
| 1 | False | 5 |
| 2 | False | 6 |
# xdz and zdf contain different variables of the same data instances in the same order.
# Append the colmuns of zdf to xdf
pd.concat([xdf, zdf], axis=1)
| x1 | x2 | x3 | x4 | |
|---|---|---|---|---|
| 0 | A | 1 | True | 4 |
| 1 | B | 2 | False | 5 |
| 2 | C | 3 | False | 6 |
Let's use merge() on our flights data. For these examples, we’ll make it easier to see what’s going on in the examples by creating a narrower dataframes:
# Create a smaller dataframe
flights2 = flights[['year','month', 'day', 'hour', 'origin', 'dest', 'tailnum', 'carrier']]
flights2
| year | month | day | hour | origin | dest | tailnum | carrier | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ |
336776 rows × 8 columns
Imagine you want to add the full airline name to the flights2 data. You can combine the airlines and flights2 data frames with a left join:
pd.merge(flights2, airlines, how='left', on='carrier')
| year | month | day | hour | origin | dest | tailnum | carrier | name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | United Air Lines Inc. |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | United Air Lines Inc. |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | American Airlines Inc. |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | JetBlue Airways |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | Delta Air Lines Inc. |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | Endeavor Air Inc. |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | Endeavor Air Inc. |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | Envoy Air |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | Envoy Air |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | Envoy Air |
336776 rows × 9 columns
# This gives you the same results
flights2.merge(airlines, how='left', on='carrier')
| year | month | day | hour | origin | dest | tailnum | carrier | name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | United Air Lines Inc. |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | United Air Lines Inc. |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | American Airlines Inc. |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | JetBlue Airways |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | Delta Air Lines Inc. |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | Endeavor Air Inc. |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | Endeavor Air Inc. |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | Envoy Air |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | Envoy Air |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | Envoy Air |
336776 rows × 9 columns
Sometimes, we need to join on multiple columns.
# For each flight, show the weather of the day as well
pd.merge(flights2, weather, how='left', on=["year", "month", "day", "hour", "origin"])
| year | month | day | hour | origin | dest | tailnum | carrier | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | 39.02 | 28.04 | 64.43 | 260.0 | 12.65858 | NaN | 0.0 | 1011.9 | 10.0 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | 39.92 | 24.98 | 54.81 | 250.0 | 14.96014 | 21.86482 | 0.0 | 1011.4 | 10.0 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | 39.02 | 26.96 | 61.63 | 260.0 | 14.96014 | NaN | 0.0 | 1012.1 | 10.0 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | 39.02 | 26.96 | 61.63 | 260.0 | 14.96014 | NaN | 0.0 | 1012.1 | 10.0 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | 39.92 | 24.98 | 54.81 | 260.0 | 16.11092 | 23.01560 | 0.0 | 1011.7 | 10.0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | 68.00 | 55.04 | 63.21 | 190.0 | 11.50780 | NaN | 0.0 | 1016.6 | 10.0 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | 64.94 | 53.06 | 65.37 | 200.0 | 6.90468 | NaN | 0.0 | 1015.8 | 10.0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | 69.08 | 48.02 | 46.99 | 70.0 | 5.75390 | NaN | 0.0 | 1016.7 | 10.0 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | 66.92 | 48.92 | 52.35 | 70.0 | 8.05546 | NaN | 0.0 | 1017.5 | 10.0 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | 60.98 | 51.08 | 69.86 | NaN | 5.75390 | NaN | 0.0 | 1018.6 | 10.0 | 2013-09-30T12:00:00Z |
336776 rows × 18 columns
# The default, on = None, uses all matching variables that appear in both tables, the so called 'natural join'.
pd.merge(flights2, weather, how='left')
| year | month | day | hour | origin | dest | tailnum | carrier | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | 39.02 | 28.04 | 64.43 | 260.0 | 12.65858 | NaN | 0.0 | 1011.9 | 10.0 | 2013-01-01T10:00:00Z |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | 39.92 | 24.98 | 54.81 | 250.0 | 14.96014 | 21.86482 | 0.0 | 1011.4 | 10.0 | 2013-01-01T10:00:00Z |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | 39.02 | 26.96 | 61.63 | 260.0 | 14.96014 | NaN | 0.0 | 1012.1 | 10.0 | 2013-01-01T10:00:00Z |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | 39.02 | 26.96 | 61.63 | 260.0 | 14.96014 | NaN | 0.0 | 1012.1 | 10.0 | 2013-01-01T10:00:00Z |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | 39.92 | 24.98 | 54.81 | 260.0 | 16.11092 | 23.01560 | 0.0 | 1011.7 | 10.0 | 2013-01-01T11:00:00Z |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | 68.00 | 55.04 | 63.21 | 190.0 | 11.50780 | NaN | 0.0 | 1016.6 | 10.0 | 2013-09-30T18:00:00Z |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | 64.94 | 53.06 | 65.37 | 200.0 | 6.90468 | NaN | 0.0 | 1015.8 | 10.0 | 2013-10-01T02:00:00Z |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | 69.08 | 48.02 | 46.99 | 70.0 | 5.75390 | NaN | 0.0 | 1016.7 | 10.0 | 2013-09-30T16:00:00Z |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | 66.92 | 48.92 | 52.35 | 70.0 | 8.05546 | NaN | 0.0 | 1017.5 | 10.0 | 2013-09-30T15:00:00Z |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | 60.98 | 51.08 | 69.86 | NaN | 5.75390 | NaN | 0.0 | 1018.6 | 10.0 | 2013-09-30T12:00:00Z |
336776 rows × 18 columns
Sometimes, the column names from the two dataframes may not match. Then, you need to explicitly specify the columns from each side.
# We want to combine the flights data with the airports data (key: 'faa').
# For origin:
pd.merge(flights2, airports, how='left', left_on='origin', right_on='faa')
| year | month | day | hour | origin | dest | tailnum | carrier | faa | name | lat | lon | alt | tz | dst | tzone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | EWR | Newark Liberty Intl | 40.692500 | -74.168667 | 18 | -5 | A | America/New_York |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | JFK | John F Kennedy Intl | 40.639751 | -73.778925 | 13 | -5 | A | America/New_York |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | JFK | John F Kennedy Intl | 40.639751 | -73.778925 | 13 | -5 | A | America/New_York |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | JFK | John F Kennedy Intl | 40.639751 | -73.778925 | 13 | -5 | A | America/New_York |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | LGA | La Guardia | 40.777245 | -73.872608 | 22 | -5 | A | America/New_York |
336776 rows × 16 columns
# For destination:
pd.merge(flights2, airports, how='left', left_on='dest', right_on='faa')
| year | month | day | hour | origin | dest | tailnum | carrier | faa | name | lat | lon | alt | tz | dst | tzone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | IAH | George Bush Intercontinental | 29.984433 | -95.341442 | 97.0 | -6.0 | A | America/Chicago |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | IAH | George Bush Intercontinental | 29.984433 | -95.341442 | 97.0 | -6.0 | A | America/Chicago |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | MIA | Miami Intl | 25.793250 | -80.290556 | 8.0 | -5.0 | A | America/New_York |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | ATL | Hartsfield Jackson Atlanta Intl | 33.636719 | -84.428067 | 1026.0 | -5.0 | A | America/New_York |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | DCA | Ronald Reagan Washington Natl | 38.852083 | -77.037722 | 15.0 | -5.0 | A | America/New_York |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | SYR | Syracuse Hancock Intl | 43.111187 | -76.106311 | 421.0 | -5.0 | A | America/New_York |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | BNA | Nashville Intl | 36.124472 | -86.678194 | 599.0 | -6.0 | A | America/Chicago |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | CLE | Cleveland Hopkins Intl | 41.411689 | -81.849794 | 791.0 | -5.0 | A | America/New_York |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | RDU | Raleigh Durham Intl | 35.877639 | -78.787472 | 435.0 | -5.0 | A | America/New_York |
336776 rows × 16 columns
# Compute the average arrival delay by destination,
# then join with the airports dataframe.
(
flights
.groupby('dest')
.agg({'arr_delay':'mean'})
.reset_index()
.merge(airports, left_on='dest', right_on='faa')
)
| dest | arr_delay | faa | name | lat | lon | alt | tz | dst | tzone | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABQ | 4.381890 | ABQ | Albuquerque International Sunport | 35.040222 | -106.609194 | 5355 | -7 | A | America/Denver |
| 1 | ACK | 4.852273 | ACK | Nantucket Mem | 41.253053 | -70.060181 | 48 | -5 | A | America/New_York |
| 2 | ALB | 14.397129 | ALB | Albany Intl | 42.748267 | -73.801692 | 285 | -5 | A | America/New_York |
| 3 | ANC | -2.500000 | ANC | Ted Stevens Anchorage Intl | 61.174361 | -149.996361 | 152 | -9 | A | America/Anchorage |
| 4 | ATL | 11.300113 | ATL | Hartsfield Jackson Atlanta Intl | 33.636719 | -84.428067 | 1026 | -5 | A | America/New_York |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 96 | TPA | 7.408525 | TPA | Tampa Intl | 27.975472 | -82.533250 | 26 | -5 | A | America/New_York |
| 97 | TUL | 33.659864 | TUL | Tulsa Intl | 36.198389 | -95.888111 | 677 | -6 | A | America/Chicago |
| 98 | TVC | 12.968421 | TVC | Cherry Capital Airport | 44.741445 | -85.582235 | 624 | -5 | A | America/New_York |
| 99 | TYS | 24.069204 | TYS | Mc Ghee Tyson | 35.810972 | -83.994028 | 981 | -5 | A | America/New_York |
| 100 | XNA | 7.465726 | XNA | NW Arkansas Regional | 36.281869 | -94.306811 | 1287 | -6 | A | America/Chicago |
101 rows × 10 columns
# Add the fullname of the origin and destination airports to flights.
(
flights2
.merge(airports, how='left', left_on='origin', right_on='faa')
.drop(columns=['faa','lat','lon','alt','tz','dst','tzone'])
.merge(airports, how='left', left_on='dest', right_on='faa')
.drop(columns=['faa','lat','lon','alt','tz','dst','tzone'])
.rename(columns = {'name_x':'origin_name','name_y':'destination_name'})
)
| year | month | day | hour | origin | dest | tailnum | carrier | origin_name | destination_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 1 | 1 | 5 | EWR | IAH | N14228 | UA | Newark Liberty Intl | George Bush Intercontinental |
| 1 | 2013 | 1 | 1 | 5 | LGA | IAH | N24211 | UA | La Guardia | George Bush Intercontinental |
| 2 | 2013 | 1 | 1 | 5 | JFK | MIA | N619AA | AA | John F Kennedy Intl | Miami Intl |
| 3 | 2013 | 1 | 1 | 5 | JFK | BQN | N804JB | B6 | John F Kennedy Intl | NaN |
| 4 | 2013 | 1 | 1 | 6 | LGA | ATL | N668DN | DL | La Guardia | Hartsfield Jackson Atlanta Intl |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 336771 | 2013 | 9 | 30 | 14 | JFK | DCA | NaN | 9E | John F Kennedy Intl | Ronald Reagan Washington Natl |
| 336772 | 2013 | 9 | 30 | 22 | LGA | SYR | NaN | 9E | La Guardia | Syracuse Hancock Intl |
| 336773 | 2013 | 9 | 30 | 12 | LGA | BNA | N535MQ | MQ | La Guardia | Nashville Intl |
| 336774 | 2013 | 9 | 30 | 11 | LGA | CLE | N511MQ | MQ | La Guardia | Cleveland Hopkins Intl |
| 336775 | 2013 | 9 | 30 | 8 | LGA | RDU | N839MQ | MQ | La Guardia | Raleigh Durham Intl |
336776 rows × 10 columns
# What weather conditions make it more likely to cause a depature delay?
(
flights[['year','month','day','hour','origin','dep_delay']]
.merge(weather)
[['dep_delay', 'temp', 'dewp','humid', 'wind_dir'
, 'wind_speed', 'wind_gust', 'precip', 'pressure','visib']]
.corr()
)
| dep_delay | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | |
|---|---|---|---|---|---|---|---|---|---|---|
| dep_delay | 1.000000 | 0.061491 | 0.102353 | 0.117494 | -0.017562 | 0.047424 | 0.041326 | 0.090400 | -0.114237 | -0.094118 |
| temp | 0.061491 | 1.000000 | 0.882265 | 0.035520 | -0.099931 | -0.146848 | -0.330167 | 0.010341 | -0.245974 | 0.102372 |
| dewp | 0.102353 | 0.882265 | 1.000000 | 0.492246 | -0.235978 | -0.221230 | -0.277960 | 0.100632 | -0.278175 | -0.128334 |
| humid | 0.117494 | 0.035520 | 0.492246 | 1.000000 | -0.324957 | -0.192249 | 0.053327 | 0.236107 | -0.159277 | -0.548684 |
| wind_dir | -0.017562 | -0.099931 | -0.235978 | -0.324957 | 1.000000 | 0.341566 | 0.069168 | -0.068000 | -0.209285 | 0.208869 |
| wind_speed | 0.047424 | -0.146848 | -0.221230 | -0.192249 | 0.341566 | 1.000000 | 0.883032 | 0.037896 | -0.215353 | 0.058108 |
| wind_gust | 0.041326 | -0.330167 | -0.277960 | 0.053327 | 0.069168 | 0.883032 | 1.000000 | 0.070128 | -0.240627 | -0.135805 |
| precip | 0.090400 | 0.010341 | 0.100632 | 0.236107 | -0.068000 | 0.037896 | 0.070128 | 1.000000 | -0.088148 | -0.316219 |
| pressure | -0.114237 | -0.245974 | -0.278175 | -0.159277 | -0.209285 | -0.215353 | -0.240627 | -0.088148 | 1.000000 | 0.107774 |
| visib | -0.094118 | 0.102372 | -0.128334 | -0.548684 | 0.208869 | 0.058108 | -0.135805 | -0.316219 | 0.107774 | 1.000000 |
In the next few assignments, you will be working with this data set of IMDB top 1000 movies.
Source: https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
import pandas as pd
import numpy as np
# Read the data file "imdb_top_1000.csv" to a dataframe named "imdb"
imdb = pd.read_csv('../data/imdb_top_1000.csv', header=0)
imdb.head()
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28,341,469 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134,966,411 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534,858,444 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57,300,000 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4,360,000 |
# Number of rows?
# Number of columns?
imdb.shape
(1000, 16)
# Describe the dataframe using the info() method.
imdb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Poster_Link 1000 non-null object 1 Series_Title 1000 non-null object 2 Released_Year 1000 non-null object 3 Certificate 899 non-null object 4 Runtime 1000 non-null object 5 Genre 1000 non-null object 6 IMDB_Rating 1000 non-null float64 7 Overview 1000 non-null object 8 Meta_score 843 non-null float64 9 Director 1000 non-null object 10 Star1 1000 non-null object 11 Star2 1000 non-null object 12 Star3 1000 non-null object 13 Star4 1000 non-null object 14 No_of_Votes 1000 non-null int64 15 Gross 831 non-null object dtypes: float64(2), int64(1), object(13) memory usage: 125.1+ KB
# Use describe() to summarize the descriptive statistics of *all* columns
imdb.describe(include='all')
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 1000 | 1000 | 899 | 1000 | 1000 | 1000.000000 | 1000 | 843.000000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1.000000e+03 | 831 |
| unique | 1000 | 999 | 100 | 16 | 140 | 202 | NaN | 1000 | NaN | 548 | 660 | 841 | 891 | 939 | NaN | 823 |
| top | https://m.media-amazon.com/images/M/MV5BMDFkYT... | Drishyam | 2014 | U | 100 min | Drama | NaN | Two imprisoned men bond over a number of years... | NaN | Alfred Hitchcock | Tom Hanks | Emma Watson | Rupert Grint | Michael Caine | NaN | 4,360,000 |
| freq | 1 | 2 | 32 | 234 | 23 | 85 | NaN | 1 | NaN | 14 | 12 | 7 | 5 | 4 | NaN | 5 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | 7.949300 | NaN | 77.971530 | NaN | NaN | NaN | NaN | NaN | 2.736929e+05 | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | 0.275491 | NaN | 12.376099 | NaN | NaN | NaN | NaN | NaN | 3.273727e+05 | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | 7.600000 | NaN | 28.000000 | NaN | NaN | NaN | NaN | NaN | 2.508800e+04 | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | 7.700000 | NaN | 70.000000 | NaN | NaN | NaN | NaN | NaN | 5.552625e+04 | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | 7.900000 | NaN | 79.000000 | NaN | NaN | NaN | NaN | NaN | 1.385485e+05 | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | 8.100000 | NaN | 87.000000 | NaN | NaN | NaN | NaN | NaN | 3.741612e+05 | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | 9.300000 | NaN | 100.000000 | NaN | NaN | NaN | NaN | NaN | 2.343110e+06 | NaN |
# List all the column names:
imdb.columns
Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
dtype='object')
# In this dataset, there is a movie with an error in "Released_Year". (Hint: Released_Year should be a 4-digit integer.)
# Find this movie.
imdb[imdb['Released_Year'].str.isdigit()==False]
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 966 | https://m.media-amazon.com/images/M/MV5BNjEzYj... | Apollo 13 | PG | U | 140 min | Adventure, Drama, History | 7.6 | NASA must devise a strategy to return Apollo 1... | 77.0 | Ron Howard | Tom Hanks | Bill Paxton | Kevin Bacon | Gary Sinise | 269197 | 173,837,933 |
imdb['Released_Year'].unique()
array(['1994', '1972', '2008', '1974', '1957', '2003', '1993', '2010',
'1999', '2001', '1966', '2002', '1990', '1980', '1975', '2020',
'2019', '2014', '1998', '1997', '1995', '1991', '1977', '1962',
'1954', '1946', '2011', '2006', '2000', '1988', '1985', '1968',
'1960', '1942', '1936', '1931', '2018', '2017', '2016', '2012',
'2009', '2007', '1984', '1981', '1979', '1971', '1963', '1964',
'1950', '1940', '2013', '2005', '2004', '1992', '1987', '1986',
'1983', '1976', '1973', '1965', '1959', '1958', '1952', '1948',
'1944', '1941', '1927', '1921', '2015', '1996', '1989', '1978',
'1961', '1955', '1953', '1925', '1924', '1982', '1967', '1951',
'1949', '1939', '1937', '1934', '1928', '1926', '1920', '1970',
'1969', '1956', '1947', '1945', '1930', '1938', '1935', '1933',
'1932', '1922', '1943', 'PG'], dtype=object)
# Correct the values for the corresponding columns ("Release_Year" and "Certificate").
# You may want to look up this movie on www.imdb.com.
# Hint: You can set value for a particular set by: loc[row_name, column_name] = new_value
imdb.loc[966,'Released_Year'] = 1995
imdb.loc[966,'Certificate'] = 'PG'
imdb.loc[966]
Poster_Link https://m.media-amazon.com/images/M/MV5BNjEzYj... Series_Title Apollo 13 Released_Year 1995 Certificate PG Runtime 140 min Genre Adventure, Drama, History IMDB_Rating 7.6 Overview NASA must devise a strategy to return Apollo 1... Meta_score 77.0 Director Ron Howard Star1 Tom Hanks Star2 Bill Paxton Star3 Kevin Bacon Star4 Gary Sinise No_of_Votes 269197 Gross 173,837,933 Name: 966, dtype: object
# Change "Released_Year" from string to int
imdb['Released_Year'] = imdb['Released_Year'].apply(int)
imdb['Released_Year'].dtype
dtype('int64')
# Create a new dataframe called "stars" including the following columns:
# Series_Title, Released_Year, Star1, Star2, Star3, Star4
stars = imdb[['Series_Title', 'Released_Year', 'Star1', 'Star2', 'Star3', 'Star4']]
stars
| Series_Title | Released_Year | Star1 | Star2 | Star3 | Star4 | |
|---|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler |
| 1 | The Godfather | 1972 | Marlon Brando | Al Pacino | James Caan | Diane Keaton |
| 2 | The Dark Knight | 2008 | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine |
| 3 | The Godfather: Part II | 1974 | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton |
| 4 | 12 Angry Men | 1957 | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Audrey Hepburn | George Peppard | Patricia Neal | Buddy Ebsen |
| 996 | Giant | 1956 | Elizabeth Taylor | Rock Hudson | James Dean | Carroll Baker |
| 997 | From Here to Eternity | 1953 | Burt Lancaster | Montgomery Clift | Deborah Kerr | Donna Reed |
| 998 | Lifeboat | 1944 | Tallulah Bankhead | John Hodiak | Walter Slezak | William Bendix |
| 999 | The 39 Steps | 1935 | Robert Donat | Madeleine Carroll | Lucie Mannheim | Godfrey Tearle |
1000 rows × 6 columns
# Create a new dataframe called "genres" including the following columns:
# Series_Title, Released_Year, Genre.
genres = imdb[['Series_Title', 'Released_Year', 'Genre']]
genres
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Drama |
| 1 | The Godfather | 1972 | Crime, Drama |
| 2 | The Dark Knight | 2008 | Action, Crime, Drama |
| 3 | The Godfather: Part II | 1974 | Crime, Drama |
| 4 | 12 Angry Men | 1957 | Crime, Drama |
| ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Comedy, Drama, Romance |
| 996 | Giant | 1956 | Drama, Western |
| 997 | From Here to Eternity | 1953 | Drama, Romance, War |
| 998 | Lifeboat | 1944 | Drama, War |
| 999 | The 39 Steps | 1935 | Crime, Mystery, Thriller |
1000 rows × 3 columns
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
(
imdb[(imdb.Released_Year>=2010) & (imdb.IMDB_Rating>=8.5)]
[['Series_Title','Released_Year','Certificate','Gross']]
.sort_values('Gross', ascending=False)
)
| Series_Title | Released_Year | Certificate | Gross | |
|---|---|---|---|---|
| 19 | Gisaengchung | 2019 | A | 53,367,844 |
| 33 | Joker | 2019 | A | 335,451,311 |
| 8 | Inception | 2010 | UA | 292,576,195 |
| 21 | Interstellar | 2014 | UA | 188,020,017 |
| 35 | The Intouchables | 2011 | UA | 13,182,281 |
| 34 | Whiplash | 2014 | A | 13,092,000 |
| 18 | Hamilton | 2020 | PG-13 | NaN |
| 20 | Soorarai Pottru | 2020 | U | NaN |
# Does the sorting result looks right to you? What's the problem?
# Answer: Gross is a string type.
# Resolve this problem of "Gross" and convert its data type to float
# Hint: You may find this webpage useful:
# https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-column-of-a-pandas-dataframe
imdb['Gross'] = imdb['Gross'].apply(str).str.replace(',','').apply(float)
imdb['Gross']
0 28341469.0
1 134966411.0
2 534858444.0
3 57300000.0
4 4360000.0
...
995 NaN
996 NaN
997 30500000.0
998 NaN
999 NaN
Name: Gross, Length: 1000, dtype: float64
# Next, redo the sorting on Gross
# Select all movies released after (>=) 2010 and with IMDB_Rating>=8.5
# Show their title, released year, Certificate, and gross.
# Sort them in descending order of "Gross"
(
imdb[(imdb.Released_Year>=2010) & (imdb.IMDB_Rating>=8.5)]
[['Series_Title','Released_Year','Certificate','Gross']]
.sort_values('Gross', ascending=False)
)
| Series_Title | Released_Year | Certificate | Gross | |
|---|---|---|---|---|
| 33 | Joker | 2019 | A | 335451311.0 |
| 8 | Inception | 2010 | UA | 292576195.0 |
| 21 | Interstellar | 2014 | UA | 188020017.0 |
| 19 | Gisaengchung | 2019 | A | 53367844.0 |
| 35 | The Intouchables | 2011 | UA | 13182281.0 |
| 34 | Whiplash | 2014 | A | 13092000.0 |
| 18 | Hamilton | 2020 | PG-13 | NaN |
| 20 | Soorarai Pottru | 2020 | U | NaN |
# Add a new column "Runtime_min" by removing the string ' min" in "Runtime"
# Set its data type as int
imdb['Runtime_min'] = imdb['Runtime'].str[:-4].astype(int)
imdb['Runtime_min']
0 142
1 175
2 152
3 202
4 96
...
995 115
996 201
997 118
998 97
999 86
Name: Runtime_min, Length: 1000, dtype: int32
# Add a new column "Decade" with values as 1980, 1990, 2000, 2010, 2020, etc.
imdb['Decade'] = imdb['Released_Year'] // 10 * 10
imdb['Decade']
0 1990
1 1970
2 2000
3 1970
4 1950
...
995 1960
996 1950
997 1950
998 1940
999 1930
Name: Decade, Length: 1000, dtype: int64
print(imdb.columns)
imdb.head()
Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross',
'Runtime_min', 'Decade'],
dtype='object')
| Poster_Link | Series_Title | Released_Year | Certificate | Runtime | Genre | IMDB_Rating | Overview | Meta_score | Director | Star1 | Star2 | Star3 | Star4 | No_of_Votes | Gross | Runtime_min | Decade | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://m.media-amazon.com/images/M/MV5BMDFkYT... | The Shawshank Redemption | 1994 | A | 142 min | Drama | 9.3 | Two imprisoned men bond over a number of years... | 80.0 | Frank Darabont | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler | 2343110 | 28341469.0 | 142 | 1990 |
| 1 | https://m.media-amazon.com/images/M/MV5BM2MyNj... | The Godfather | 1972 | A | 175 min | Crime, Drama | 9.2 | An organized crime dynasty's aging patriarch t... | 100.0 | Francis Ford Coppola | Marlon Brando | Al Pacino | James Caan | Diane Keaton | 1620367 | 134966411.0 | 175 | 1970 |
| 2 | https://m.media-amazon.com/images/M/MV5BMTMxNT... | The Dark Knight | 2008 | UA | 152 min | Action, Crime, Drama | 9.0 | When the menace known as the Joker wreaks havo... | 84.0 | Christopher Nolan | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine | 2303232 | 534858444.0 | 152 | 2000 |
| 3 | https://m.media-amazon.com/images/M/MV5BMWMwMG... | The Godfather: Part II | 1974 | A | 202 min | Crime, Drama | 9.0 | The early life and career of Vito Corleone in ... | 90.0 | Francis Ford Coppola | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton | 1129952 | 57300000.0 | 202 | 1970 |
| 4 | https://m.media-amazon.com/images/M/MV5BMWU4N2... | 12 Angry Men | 1957 | U | 96 min | Crime, Drama | 9.0 | A jury holdout attempts to prevent a miscarria... | 96.0 | Sidney Lumet | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler | 689845 | 4360000.0 | 96 | 1950 |
print(stars.columns)
stars.head()
Index(['Series_Title', 'Released_Year', 'Star1', 'Star2', 'Star3', 'Star4'], dtype='object')
| Series_Title | Released_Year | Star1 | Star2 | Star3 | Star4 | |
|---|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler |
| 1 | The Godfather | 1972 | Marlon Brando | Al Pacino | James Caan | Diane Keaton |
| 2 | The Dark Knight | 2008 | Christian Bale | Heath Ledger | Aaron Eckhart | Michael Caine |
| 3 | The Godfather: Part II | 1974 | Al Pacino | Robert De Niro | Robert Duvall | Diane Keaton |
| 4 | 12 Angry Men | 1957 | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler |
print(genres.columns)
genres.head()
Index(['Series_Title', 'Released_Year', 'Genre'], dtype='object')
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Drama |
| 1 | The Godfather | 1972 | Crime, Drama |
| 2 | The Dark Knight | 2008 | Action, Crime, Drama |
| 3 | The Godfather: Part II | 1974 | Crime, Drama |
| 4 | 12 Angry Men | 1957 | Crime, Drama |
Follow the instructions below and write your code to answer the questions:
To better understand the advantages of tidy data, you will first use the "un-tidy" dataframes alone to answer the next few questions:
# In dataframe "stars", find all movies that star "Morgan Freeman".
# Hint: he could be Star1, Star2, Star3, or Star4.
(
stars.loc[(stars['Star1']=='Morgan Freeman')
|(stars['Star2']=='Morgan Freeman')
|(stars['Star3']=='Morgan Freeman')
|(stars['Star4']=='Morgan Freeman')]
)
| Series_Title | Released_Year | Star1 | Star2 | Star3 | Star4 | |
|---|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Tim Robbins | Morgan Freeman | Bob Gunton | William Sadler |
| 27 | Se7en | 1995 | Morgan Freeman | Brad Pitt | Kevin Spacey | Andrew Kevin Walker |
| 167 | Unforgiven | 1992 | Clint Eastwood | Gene Hackman | Morgan Freeman | Richard Harris |
| 234 | Million Dollar Baby | 2004 | Hilary Swank | Clint Eastwood | Morgan Freeman | Jay Baruchel |
| 673 | Glory | 1989 | Matthew Broderick | Denzel Washington | Cary Elwes | Morgan Freeman |
| 768 | Lucky Number Slevin | 2006 | Josh Hartnett | Ben Kingsley | Morgan Freeman | Lucy Liu |
| 922 | Gone Baby Gone | 2007 | Morgan Freeman | Ed Harris | Casey Affleck | Michelle Monaghan |
# In dataframe 'stars', who appeared in Star2 the most times? List the top five actors.
stars['Star2'].value_counts().head(5)
Emma Watson 7 Matt Damon 5 Kate Winslet 4 Ian McKellen 4 Chris Evans 4 Name: Star2, dtype: int64
# In dataframe 'imdb', find all comedies and list Series_Title, Released_Year, Director, and IMDB_Rating
# Sort them by Released_Year in desceding order.
# Hint: use .str.contains(...)
(
imdb
.loc[imdb['Genre'].str.contains('Comedy')
,['Series_Title','Released_Year','Director','IMDB_Rating']]
.sort_values('Released_Year',ascending=False)
)
| Series_Title | Released_Year | Director | IMDB_Rating | |
|---|---|---|---|---|
| 464 | Dil Bechara | 2020 | Mukesh Chhabra | 7.9 |
| 613 | Druk | 2020 | Thomas Vinterberg | 7.8 |
| 205 | Soul | 2020 | Pete Docter | 8.1 |
| 466 | Marriage Story | 2019 | Noah Baumbach | 7.9 |
| 884 | The Peanut Butter Falcon | 2019 | Tyler Nilson | 7.6 |
| ... | ... | ... | ... | ... |
| 318 | The Circus | 1928 | Charles Chaplin | 8.1 |
| 320 | The General | 1926 | Clyde Bruckman | 8.1 |
| 193 | The Gold Rush | 1925 | Charles Chaplin | 8.2 |
| 194 | Sherlock Jr. | 1924 | Buster Keaton | 8.2 |
| 127 | The Kid | 1921 | Charles Chaplin | 8.3 |
233 rows × 4 columns
# In dataframe 'imdb', find all values in the Genre column and the number of occurrences for each value.
# Hint: use value_counts()
imdb['Genre'].value_counts()
Drama 85
Drama, Romance 37
Comedy, Drama 35
Comedy, Drama, Romance 31
Action, Crime, Drama 30
..
Adventure, Thriller 1
Animation, Action, Sci-Fi 1
Action, Crime, Comedy 1
Animation, Crime, Mystery 1
Adventure, Comedy, War 1
Name: Genre, Length: 202, dtype: int64
Next, you will further tidy the two dataframes stars and genres.
Let's start with stars.
# Tranform the dataframe "stars" to a new dataframe named "stars_long" with the following four columns:
# Series_Title, Released_Year, StarNo (e.g., Star1, Star2, ...), StarName
# Hint: use melt()
stars_long = stars.melt(id_vars=['Series_Title','Released_Year'],
value_vars=['Star1','Star2','Star3','Star4'],
var_name='StarNo',
value_name='StarName')
stars_long
| Series_Title | Released_Year | StarNo | StarName | |
|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Star1 | Tim Robbins |
| 1 | The Godfather | 1972 | Star1 | Marlon Brando |
| 2 | The Dark Knight | 2008 | Star1 | Christian Bale |
| 3 | The Godfather: Part II | 1974 | Star1 | Al Pacino |
| 4 | 12 Angry Men | 1957 | Star1 | Henry Fonda |
| ... | ... | ... | ... | ... |
| 3995 | Breakfast at Tiffany's | 1961 | Star4 | Buddy Ebsen |
| 3996 | Giant | 1956 | Star4 | Carroll Baker |
| 3997 | From Here to Eternity | 1953 | Star4 | Donna Reed |
| 3998 | Lifeboat | 1944 | Star4 | William Bendix |
| 3999 | The 39 Steps | 1935 | Star4 | Godfrey Tearle |
4000 rows × 4 columns
# Can you transform dataframe 'stars_long' back to its original shape?
# Hint: use pivot()
stars_long.pivot(index=['Series_Title','Released_Year'], columns='StarNo', values='StarName').reset_index()
| StarNo | Series_Title | Released_Year | Star1 | Star2 | Star3 | Star4 |
|---|---|---|---|---|---|---|
| 0 | (500) Days of Summer | 2009 | Zooey Deschanel | Joseph Gordon-Levitt | Geoffrey Arend | Chloë Grace Moretz |
| 1 | 12 Angry Men | 1957 | Henry Fonda | Lee J. Cobb | Martin Balsam | John Fiedler |
| 2 | 12 Years a Slave | 2013 | Chiwetel Ejiofor | Michael Kenneth Williams | Michael Fassbender | Brad Pitt |
| 3 | 1917 | 2019 | Dean-Charles Chapman | George MacKay | Daniel Mays | Colin Firth |
| 4 | 2001: A Space Odyssey | 1968 | Keir Dullea | Gary Lockwood | William Sylvester | Daniel Richter |
| ... | ... | ... | ... | ... | ... | ... |
| 995 | Zootopia | 2016 | Rich Moore | Jared Bush | Ginnifer Goodwin | Jason Bateman |
| 996 | Zulu | 1964 | Stanley Baker | Jack Hawkins | Ulla Jacobsson | James Booth |
| 997 | Zwartboek | 2006 | Carice van Houten | Sebastian Koch | Thom Hoffman | Halina Reijn |
| 998 | À bout de souffle | 1960 | Jean-Paul Belmondo | Jean Seberg | Daniel Boulanger | Henri-Jacques Huet |
| 999 | Ôkami kodomo no Ame to Yuki | 2012 | Aoi Miyazaki | Takao Osawa | Haru Kuroki | Yukito Nishii |
1000 rows × 6 columns
# In dataframe "stars_long", find all movies that star "Morgan Freeman".
stars_long.loc[(stars_long['StarName']=='Morgan Freeman')]
| Series_Title | Released_Year | StarNo | StarName | |
|---|---|---|---|---|
| 27 | Se7en | 1995 | Star1 | Morgan Freeman |
| 922 | Gone Baby Gone | 2007 | Star1 | Morgan Freeman |
| 1000 | The Shawshank Redemption | 1994 | Star2 | Morgan Freeman |
| 2167 | Unforgiven | 1992 | Star3 | Morgan Freeman |
| 2234 | Million Dollar Baby | 2004 | Star3 | Morgan Freeman |
| 2768 | Lucky Number Slevin | 2006 | Star3 | Morgan Freeman |
| 3673 | Glory | 1989 | Star4 | Morgan Freeman |
# Who were the Star2 the most times? List the top five actors.
stars_long.loc[(stars_long['StarNo']=='Star2')]['StarName'].value_counts().head(5)
Emma Watson 7 Matt Damon 5 Kate Winslet 4 Ian McKellen 4 Chris Evans 4 Name: StarName, dtype: int64
# Who star in the most movies in this list? List the top 20 actors.
stars_long['StarName'].value_counts().head(20)
Robert De Niro 17 Tom Hanks 14 Al Pacino 13 Clint Eastwood 12 Brad Pitt 12 Leonardo DiCaprio 11 Matt Damon 11 Christian Bale 11 James Stewart 10 Ethan Hawke 9 Scarlett Johansson 9 Michael Caine 9 Johnny Depp 9 Humphrey Bogart 9 Denzel Washington 9 Aamir Khan 8 Harrison Ford 8 Edward Norton 7 Ian McKellen 7 Robert Downey Jr. 7 Name: StarName, dtype: int64
# Which movie stars had the highest total gross in this movie list? Show the top 10 actors.
# Hint: Join "stars_long" and "imdb"; then group by StarName
(
stars_long
.merge(imdb)
.groupby('StarName')
.agg({'Gross':'sum'})
.sort_values('Gross',ascending=False)
.head(10)
.reset_index()
)
| StarName | Gross | |
|---|---|---|
| 0 | Robert Downey Jr. | 3.129073e+09 |
| 1 | Tom Hanks | 2.903565e+09 |
| 2 | Chris Evans | 2.339664e+09 |
| 3 | Joe Russo | 2.205039e+09 |
| 4 | Mark Ruffalo | 2.058396e+09 |
| 5 | Leonardo DiCaprio | 2.049297e+09 |
| 6 | Ian McKellen | 1.869869e+09 |
| 7 | Rupert Grint | 1.835901e+09 |
| 8 | Daniel Radcliffe | 1.835901e+09 |
| 9 | Matt Damon | 1.728542e+09 |
# Find the best director-actor duos
# i.e., director-actor pairs that collaborated in at least five movies,
# sort them in descending order of average IMDB_Rating
# Hint: Join imdb and stars_long; group by ['Director','StarName']
(
imdb
.merge(stars_long)
.groupby(['Director','StarName'])
.agg({'Series_Title':'count','IMDB_Rating':'mean'})
.query('Series_Title>=5')
.sort_values('IMDB_Rating',ascending=False)
.reset_index()
)
| Director | StarName | Series_Title | IMDB_Rating | |
|---|---|---|---|---|
| 0 | Peter Jackson | Ian McKellen | 5 | 8.400000 |
| 1 | Charles Chaplin | Charles Chaplin | 6 | 8.333333 |
| 2 | Akira Kurosawa | Toshirô Mifune | 7 | 8.242857 |
| 3 | Martin Scorsese | Robert De Niro | 6 | 8.183333 |
| 4 | Akira Kurosawa | Tatsuya Nakadai | 5 | 8.180000 |
| 5 | Clint Eastwood | Clint Eastwood | 5 | 7.960000 |
| 6 | Richard Linklater | Ethan Hawke | 5 | 7.960000 |
| 7 | Woody Allen | Woody Allen | 5 | 7.840000 |
| 8 | Joel Coen | Ethan Coen | 6 | 7.816667 |
# Bonus question
# Who did "Amy Adams" co-star with in this movie list?
# Hint: Join stars_long with itself to find pairs of co-stars
Next, let's reshape the dataframe genres, which is a little bit more complicated.
genres
| Series_Title | Released_Year | Genre | |
|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Drama |
| 1 | The Godfather | 1972 | Crime, Drama |
| 2 | The Dark Knight | 2008 | Action, Crime, Drama |
| 3 | The Godfather: Part II | 1974 | Crime, Drama |
| 4 | 12 Angry Men | 1957 | Crime, Drama |
| ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Comedy, Drama, Romance |
| 996 | Giant | 1956 | Drama, Western |
| 997 | From Here to Eternity | 1953 | Drama, Romance, War |
| 998 | Lifeboat | 1944 | Drama, War |
| 999 | The 39 Steps | 1935 | Crime, Mystery, Thriller |
1000 rows × 3 columns
# Step 1: Split the 'Genre' string by ', ' into a list of individual genres and expand them to different columns
genres_split = genres.Genre.str.split(', ', expand=True)
genres_split
# Step 2: Rename the columns as: Genre1, Genre2, Genre3
genres_split.columns = ['Genre1', 'Genre2', 'Genre3']
genres_split
| Genre1 | Genre2 | Genre3 | |
|---|---|---|---|
| 0 | Drama | None | None |
| 1 | Crime | Drama | None |
| 2 | Action | Crime | Drama |
| 3 | Crime | Drama | None |
| 4 | Crime | Drama | None |
| ... | ... | ... | ... |
| 995 | Comedy | Drama | Romance |
| 996 | Drama | Western | None |
| 997 | Drama | Romance | War |
| 998 | Drama | War | None |
| 999 | Crime | Mystery | Thriller |
1000 rows × 3 columns
# Step 3: Combine ['Series_Title','Released_Year'] in 'genres' and ['Genre1','Genre2','Genre3'] in 'genres_split'.
# Save it to a new dataframe named 'genres_wide'.
# Hint: Use pd.concat(...)
genres_wide = pd.concat([genres,genres_split], axis=1).drop(columns=['Genre'])
genres_wide
| Series_Title | Released_Year | Genre1 | Genre2 | Genre3 | |
|---|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Drama | None | None |
| 1 | The Godfather | 1972 | Crime | Drama | None |
| 2 | The Dark Knight | 2008 | Action | Crime | Drama |
| 3 | The Godfather: Part II | 1974 | Crime | Drama | None |
| 4 | 12 Angry Men | 1957 | Crime | Drama | None |
| ... | ... | ... | ... | ... | ... |
| 995 | Breakfast at Tiffany's | 1961 | Comedy | Drama | Romance |
| 996 | Giant | 1956 | Drama | Western | None |
| 997 | From Here to Eternity | 1953 | Drama | Romance | War |
| 998 | Lifeboat | 1944 | Drama | War | None |
| 999 | The 39 Steps | 1935 | Crime | Mystery | Thriller |
1000 rows × 5 columns
# Step 4: Transform genres_wide to a new dataframe genres_long with the following four columns:
# Series_Title, Released_Year, GenreNo (e.g., Genre1, Genre2, Genre2), GenreName
# Hint: use melt()
genres_long = genres_wide.melt(id_vars=['Series_Title','Released_Year'],
value_vars=['Genre1','Genre2','Genre3'],
var_name='GenreNo',
value_name='GenreName')
genres_long
| Series_Title | Released_Year | GenreNo | GenreName | |
|---|---|---|---|---|
| 0 | The Shawshank Redemption | 1994 | Genre1 | Drama |
| 1 | The Godfather | 1972 | Genre1 | Crime |
| 2 | The Dark Knight | 2008 | Genre1 | Action |
| 3 | The Godfather: Part II | 1974 | Genre1 | Crime |
| 4 | 12 Angry Men | 1957 | Genre1 | Crime |
| ... | ... | ... | ... | ... |
| 2995 | Breakfast at Tiffany's | 1961 | Genre3 | Romance |
| 2996 | Giant | 1956 | Genre3 | None |
| 2997 | From Here to Eternity | 1953 | Genre3 | War |
| 2998 | Lifeboat | 1944 | Genre3 | None |
| 2999 | The 39 Steps | 1935 | Genre3 | Thriller |
3000 rows × 4 columns
# How many movies are there for each genre?
genres_long['GenreName'].value_counts()
Drama 724 Comedy 233 Crime 209 Adventure 196 Action 189 Thriller 137 Romance 125 Biography 109 Mystery 99 Animation 82 Sci-Fi 67 Fantasy 66 History 56 Family 56 War 51 Music 35 Horror 32 Western 20 Sport 19 Film-Noir 19 Musical 17 Name: GenreName, dtype: int64
# How many unique genres (atomic values, e.g., Drama, Comedy, ...) are there?
genres_long['GenreName'].nunique()
21
# What is the average IMDB rating for each genre?
# Sort the genres in descending order of average IMDB_Rating.
# Hint: join imdb with genres_long; group by GenreName
(
imdb
.merge(genres_long)
.groupby('GenreName')
.agg({'IMDB_Rating':'mean'})
.sort_values('IMDB_Rating',ascending=False)
.reset_index()
)
| GenreName | IMDB_Rating | |
|---|---|---|
| 0 | War | 8.013725 |
| 1 | Western | 8.000000 |
| 2 | Film-Noir | 7.989474 |
| 3 | Sci-Fi | 7.977612 |
| 4 | Mystery | 7.967677 |
| 5 | Drama | 7.959392 |
| 6 | Crime | 7.954545 |
| 7 | History | 7.953571 |
| 8 | Adventure | 7.952041 |
| 9 | Action | 7.948677 |
| 10 | Musical | 7.947059 |
| 11 | Biography | 7.935780 |
| 12 | Fantasy | 7.931818 |
| 13 | Animation | 7.930488 |
| 14 | Sport | 7.926316 |
| 15 | Romance | 7.925600 |
| 16 | Music | 7.914286 |
| 17 | Family | 7.912500 |
| 18 | Thriller | 7.909489 |
| 19 | Comedy | 7.903433 |
| 20 | Horror | 7.887500 |
# Who is the "King of Comedy" (i.e., the actor who starred in the most comedy movies)?
# Hint: find all comedies in genres_long ; join with stars_long; group by StarName
(
genres_long
.loc[genres_long['GenreName']=='Comedy']
.merge(stars_long)
.groupby('StarName')
.agg({'Series_Title':'count'})
.sort_values('Series_Title',ascending=False)
.reset_index()
)
| StarName | Series_Title | |
|---|---|---|
| 0 | Charles Chaplin | 6 |
| 1 | Bill Murray | 6 |
| 2 | Aamir Khan | 5 |
| 3 | Woody Allen | 5 |
| 4 | Cary Grant | 5 |
| ... | ... | ... |
| 770 | Husnija Hasimovic | 1 |
| 771 | Ida Engvoll | 1 |
| 772 | Ileana D'Cruz | 1 |
| 773 | Imelda Staunton | 1 |
| 774 | Özge Özberk | 1 |
775 rows × 2 columns
Classification is one of the most useful and popular functions in data mining and machine learning.
Essentially, classificaiton is aimed at building a prediction model that can assign data points to a set of predefined classes, i.e., giving a class label to each data point.
There are several key concepts related to classification:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
We will use the famous Titanic data to illustrate the process of classificaiton.
# Load the training set
df_train = pd.read_csv('../data/titanic_train.csv')
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
s = df_train['Survived'].value_counts()
print(s)
print(f'the survival rate is {s[1]/s.sum():.1%}')
0 549 1 342 Name: Survived, dtype: int64 the survival rate is 38.4%
s_norm = df_train['Survived'].value_counts(normalize=True)
s_norm
0 0.616162 1 0.383838 Name: Survived, dtype: float64
Simply based on the overall survival rate (38.4%), we could build a naive prediction model:
Let's apply this model to the test set.
df_test = pd.read_csv('../data/titanic_test.csv')
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
# predict survived = 0 (dead) for all because survival rate is 38.4%
df_test['Survived'] = 0
df_test.head()
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Survived | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 0 |
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 0 |
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 0 |
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 0 |
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 0 |
# Select the two columns and save to a .csv file for submission
df_submit_allzero = df_test[['PassengerId', 'Survived']]
# Set index=False to exclude the index column
df_submit_allzero.to_csv('../data/titanic_submit_allzero.csv', index=False)
Submit this file "titanic_submit_allzero.csv" to Kaggle. Check your score and ranking.
Can we do better than this?
# Consider Sex as a predictor variable
group = df_train.groupby('Sex')['Survived'].value_counts()
print(group.index)
group
MultiIndex([('female', 1),
('female', 0),
( 'male', 0),
( 'male', 1)],
names=['Sex', 'Survived'])
Sex Survived
female 1 233
0 81
male 0 468
1 109
Name: Survived, dtype: int64
print(f'female survival rate: {group["female", 1]/group["female"].sum():.1%}')
print(f'male survival rate: {group["male", 1]/group["male"].sum():.1%}')
female survival rate: 74.2% male survival rate: 18.9%
# another (easier) way to calculate the survival rates by gender
group_norm = df_train.groupby('Sex')['Survived'].value_counts(normalize=True)
group_norm
Sex Survived
female 1 0.742038
0 0.257962
male 0 0.811092
1 0.188908
Name: Survived, dtype: float64
Since female survival rate is 74.2% and male survival rate is 18.9%, we could build another simple prediction model:
# predict survived = 0 (dead) for male and survived = 1 (survived) for female
# because female has a higher survival rate (74.2%) than male (18.9%)
df_test['Survived_Gender'] = df_test['Sex'].apply(lambda x: 0 if x=='male' else 1)
df_test.head()
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Survived | Survived_Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 0 | 0 |
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 0 | 1 |
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 0 | 0 |
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 0 | 0 |
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 0 | 1 |
df_submit_gender = df_test[['PassengerId', 'Survived_Gender']]
# change column name to 'Survived' for Kaggle submission
df_submit_gender.to_csv('../data/titanic_submit_gender.csv', index=False, header=['PassengerId', 'Survived'])
Submit this file "titanic_submit_gender.csv" to Kaggle. Check your score and ranking.
A decision tree is a prediction model that uses a tree-like structure of decisions and their possible consequences. It can be used for classificaiton and regression.
The basic idea of a decision tree is to split data set based on the homogeneity of data, i.e., reducing “impurity”.
Entropy is one of the most common measures for calculating impurity.
$H(x)=-\sum_{i=1}^np_{i}\log p_{i}$, where $p_{i}$ is the probability of class $i$ in the data.
# Entropy for a collection of 30 balls with 15 red and 15 blue
# Entropy = 1 i.e., maximal impurity
e1 = -15/30*np.log2(15/30)-15/30*np.log2(15/30)
e1
1.0
# Entropy for a collection of 30 balls with 2 red and 28 blue
e2 = -2/30*np.log2(2/30)-28/30*np.log2(28/30)
e2
0.35335933502142136
# Entropy for a collection of 30 balls with 0 red and 30 blue
# Entropy = 0 i.e., maximal purity
e3 = -np.log2(30/30)
e3
-0.0
# Define a function to caculate entropy
# input: a list of values that represents ratio of different classes, e.g., (1,2,3,4)
def entropy(ratio):
h = None
s = 0
for i in ratio:
s += i
if s>0:
h = 0
for i in ratio:
if i>0:
h += -i/s*np.log2(i/s)
return h
r = input('Please enter the ratio in format 1:2:3 : ')
ratio = [int(i) for i in r.split(':')]
entropy(ratio)
Please enter the ratio in format 1:2:3 : 1:9
0.4689955935892812
fig = plt.figure()
ax = plt.axes()
x = np.arange(0, 31, 1) # return numbers [0,1,2,...,30]
e = pd.Series([entropy([i,30-i]) for i in x])
ax.plot(x, e)
[<matplotlib.lines.Line2D at 0x191dbd1d3a0>]
Scikit-learn is one of the mose popular Python packages for predictive data analysis.
We will use sklearn for building decision trees.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.tree import DecisionTreeClassifier
df_train.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# sklearn decision tree only accept inputs in float data type without null values
# we will learn data processing and transformation later to handle null values and non-numerical features
# thus, we choose three numerical values to train the tree classifier
# pclass: Ticket class - 1 = 1st, 2 = 2nd, 3 = 3rd
# sibsp: of siblings / spouses aboard the Titanic
# fare: Passenger fare
X = df_train[['Pclass', 'SibSp', 'Fare']]
y = df_train['Survived']
print(X.shape)
print(y.shape)
(891, 3) (891,)
# Train a DT model by using all default settings.
# The default criterion='gini', not 'entropy'.
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X, y)
DecisionTreeClassifier()
# For visulizing the tree
# This may require you installing a package called "graphviz" as well.
from IPython.display import Image
from sklearn import tree
import pydotplus
import os
os.environ['PATH'] = os.environ['PATH']+';'+os.environ['CONDA_PREFIX']+r"\Library\bin\graphviz"
# get feature and class names for visualization
print(X.columns.values.tolist())
print(y.unique().tolist())
cls_names = ['died' if i == 0 else 'survived' for i in y.unique().tolist()] # convert to string for class names
cls_names
['Pclass', 'SibSp', 'Fare'] [0, 1]
['died', 'survived']
# Create DOT data
dot_data = tree.export_graphviz(tree_clf,
feature_names=X.columns.values.tolist(),
class_names=cls_names,
)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())
You can see this is a huge tree, which may lead to overfitting problem.
Let's set max_depth=3 to generate a simpler tree. Also, set criterion='entropy'.
tree_clf = DecisionTreeClassifier(criterion='entropy', max_depth=3)
tree_clf.fit(X, y)
DecisionTreeClassifier(criterion='entropy', max_depth=3)
# Create DOT data
dot_data = tree.export_graphviz(tree_clf,
feature_names=X.columns.values.tolist(),
class_names=cls_names,
)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())
Remember the training data has three features ['Pclass', 'SibSp', 'Fare'], we can predict the target based on different values for those three features
# passenger 1 who bought a class 3 ticket at $8.5 with no siblings / spouses, Jack ?
# passenger 2 who bought a first class ticket at $88 with no siblings / spouses, Rose ?
passenger1 = tree_clf.predict([[3, 0, 8.5]])
passenger2 = tree_clf.predict([[1, 0, 88]])
print(passenger1, passenger2)
[0] [1]
Next, we use this simple DT to predict for the test set.
# choose the subset of the test set
# there is a null value in Fare
X_test = df_test[['Pclass', 'SibSp', 'Fare']]
X_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null int64 1 SibSp 418 non-null int64 2 Fare 417 non-null float64 dtypes: float64(1), int64(2) memory usage: 9.9 KB
# fill the null value with mean
X_test = X_test.fillna(X_test.mean())
X_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null int64 1 SibSp 418 non-null int64 2 Fare 418 non-null float64 dtypes: float64(1), int64(2) memory usage: 9.9 KB
y_hat = tree_clf.predict(X_test)
y_hat
array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,
1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
dtype=int64)
# combine the final dataframe
df_submit_simpleDT = pd.DataFrame({
'PassengerId': df_test['PassengerId'],
'Survived': y_hat,
})
df_submit_simpleDT.head()
| PassengerId | Survived | |
|---|---|---|
| 0 | 892 | 0 |
| 1 | 893 | 0 |
| 2 | 894 | 0 |
| 3 | 895 | 0 |
| 4 | 896 | 1 |
df_submit_simpleDT.to_csv('../data/titanic_submit_simpleDT.csv', index=False)
In this notebook, we continue working on the Titanic data to predict survival by including more predictor variables. In particular, our code will:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# load dataset
df = pd.read_csv('../data/titanic_train.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# PassengerId and name are not very useful for predicting
# assume we cannot get useful information for prediction from ticket number
# let's drop these columns
df.drop(['PassengerId', 'Name', 'Ticket'], axis = 1, inplace=True)
df.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | NaN | S |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | NaN | S |
There are two main types of approaches to handle missing values in data:
# total null values
df.isnull().sum()
Survived 0 Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
# given that only 2 out of 891 rows have missing values for Embarked - let's drop those two rows
# drop the observations/rows with missing values using dropna()
# note - we are not dropping the Embarked column!!
df.dropna(subset=['Embarked'], inplace=True)
df.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | NaN | S |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | NaN | S |
# no missing values for Embarked
df.isnull().sum()
Survived 0 Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 Cabin 687 Embarked 0 dtype: int64
# given 687 out of 891 rows have missing values for Cabin feature, let's drop this feature/column
# axis defaults to 0, which means row, here we are dropping the column, therefore axis=1
df.drop('Cabin', axis=1, inplace=True)
df.isnull().sum()
Survived 0 Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 Embarked 0 dtype: int64
# we cannot drop Age feature, which does not have as many as missing values as Cabin
df['Age'].hist(bins=50)
<AxesSubplot:>
df['Age'].describe()
count 712.000000 mean 29.642093 std 14.492933 min 0.420000 25% 20.000000 50% 28.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64
df['Age'].head(10)
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 5 NaN 6 54.0 7 2.0 8 27.0 9 14.0 Name: Age, dtype: float64
# method 1: fill the missing values with median using fillna()
# note that I don't use inplace=True because I want to demostrate another method
median = df['Age'].median()
df_fill_median = df['Age'].fillna(median)
df_fill_median.head(10)
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 5 28.0 6 54.0 7 2.0 8 27.0 9 14.0 Name: Age, dtype: float64
# method 2: fill the missing values with median using Scikit-Learn SimpleImputer
# this approach can do imputation for all numerical attrributes all at once
from sklearn.impute import SimpleImputer
median_imputer = SimpleImputer(strategy='median')
# select only numerical attributes for Imputer
# NOTE: Technically, Pclass (Ticket class with values 1, 2, or 3) is a categorial attribute.
# we treat Pclass as a numerical attribute here to keep it simple.
df_num = df.select_dtypes(include=['int64','float64'])
# the following computes median for each attributes and store the result in statistics_ build_in variable
df_num_fill_median = median_imputer.fit_transform(df_num)
median_imputer.statistics_ # same result as df_num.median().values
array([ 0. , 3. , 28. , 0. , 0. , 14.4542])
df_num.median().values
array([ 0. , 3. , 28. , 0. , 0. , 14.4542])
# imputer returns a numpy array
df_num_fill_median
array([[ 0. , 3. , 22. , 1. , 0. , 7.25 ],
[ 1. , 1. , 38. , 1. , 0. , 71.2833],
[ 1. , 3. , 26. , 0. , 0. , 7.925 ],
...,
[ 0. , 3. , 28. , 1. , 2. , 23.45 ],
[ 1. , 1. , 26. , 0. , 0. , 30. ],
[ 0. , 3. , 32. , 0. , 0. , 7.75 ]])
# change a numpy array to a DataFrame
df_num_fill_median = pd.DataFrame(df_num_fill_median, columns=df_num.columns)
df_num_fill_median.isnull().sum() # no missing values
Survived 0 Pclass 0 Age 0 SibSp 0 Parch 0 Fare 0 dtype: int64
Save the target column and drop it from the training set. Since the target column has no missing values now and we normally don't need to encode target column (even if it has categorical values), let's save it and then drop it form the training set.
# set the target
y = df_num_fill_median['Survived']
df_num_fill_median.drop(['Survived'], axis=1, inplace=True)
df_num_fill_median.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 889 entries, 0 to 888 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 889 non-null float64 1 Age 889 non-null float64 2 SibSp 889 non-null float64 3 Parch 889 non-null float64 4 Fare 889 non-null float64 dtypes: float64(5) memory usage: 34.9 KB
So far, we have only included numerical variables in the model. Next, we will handle categorical variables.
# get categorial attributes
df_cat = df.select_dtypes(['object'])
df_cat.head()
| Sex | Embarked | |
|---|---|---|
| 0 | male | S |
| 1 | female | C |
| 2 | female | S |
| 3 | female | S |
| 4 | male | S |
df_cat.describe()
| Sex | Embarked | |
|---|---|---|
| count | 889 | 889 |
| unique | 2 | 3 |
| top | male | S |
| freq | 577 | 644 |
# encode categorical values to integers
from sklearn.preprocessing import OrdinalEncoder
cat_encoder = OrdinalEncoder()
df_cat_encoded = cat_encoder.fit_transform(df_cat)
# categories are listed in order ['female', 'male']-> 0, 1; ['C', 'Q', 'S']-> 0, 1, 2
cat_encoder.categories_
[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]
# encoder returns a numpy array
df_cat_encoded
array([[1., 2.],
[0., 0.],
[0., 2.],
...,
[0., 2.],
[1., 0.],
[1., 1.]])
# change df_cat_encoded into dataframe
df_cat_encoded = pd.DataFrame(df_cat_encoded, columns=df_cat.columns)
df_cat_encoded
| Sex | Embarked | |
|---|---|---|
| 0 | 1.0 | 2.0 |
| 1 | 0.0 | 0.0 |
| 2 | 0.0 | 2.0 |
| 3 | 0.0 | 2.0 |
| 4 | 1.0 | 2.0 |
| ... | ... | ... |
| 884 | 1.0 | 2.0 |
| 885 | 0.0 | 2.0 |
| 886 | 0.0 | 2.0 |
| 887 | 1.0 | 0.0 |
| 888 | 1.0 | 1.0 |
889 rows × 2 columns
# the same cat_encoder can be used to encode new data
new_data = [['male', 'S']]
new_data_encoded = cat_encoder.transform(new_data)
print(new_data_encoded)
[[1. 2.]]
Let's pause here and consider this:
Is this ordinal encoder appropriate for encoding these two variables? Why or why not?
# Using one-hot encoding
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder()
df_cat_onehot_encoded = onehot_encoder.fit_transform(df_cat)
onehot_encoder.categories_
[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]
column_names = onehot_encoder.get_feature_names()
column_names
array(['x0_female', 'x0_male', 'x1_C', 'x1_Q', 'x1_S'], dtype=object)
# onehot encoder returns a sparse matrix and we convert that to a numpy array
df_cat_onehot_encoded = df_cat_onehot_encoded.toarray()
# change df_onehot_encoded to a Dataframe
df_cat_onehot_encoded = pd.DataFrame(df_cat_onehot_encoded, columns=column_names)
df_cat_onehot_encoded.head()
| x0_female | x0_male | x1_C | x1_Q | x1_S | |
|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
# the same onehot_encoder can be used to encode new data
new_data1 = [['male', 'S']]
new_data1_encoded = onehot_encoder.transform(new_data1)
print(new_data1_encoded.toarray())
[[0. 1. 0. 0. 1.]]
Is the onehot encoder a better choice for encoding these two categorical variables?
Now we can prepare for the final train dataset and build the model:
We first generated the following dataframes:
df: original df after dropping a few columnsdf_fill_median: df with Age filled using median via fillna(), same shape as dfThen, we split the df into numerical and categorical dataframes:
df_num: numerical columns onlydf_num_fill_median: df_num with Age filled using SimpleImputerdf_cat: categorical columns onlydf_cat_encoded: df_cat encoded using Ordinal Encoderdf_cat_onehot_encoded: df_cat encoded using OneHot EncoderNow, we can preparing the final training dataset by combining some of them.
# traning dataset using ordinal encoder
titanic_train_encoded = pd.concat([df_num_fill_median, df_cat_encoded], axis=1)
titanic_train_encoded.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 889 entries, 0 to 888 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 889 non-null float64 1 Age 889 non-null float64 2 SibSp 889 non-null float64 3 Parch 889 non-null float64 4 Fare 889 non-null float64 5 Sex 889 non-null float64 6 Embarked 889 non-null float64 dtypes: float64(7) memory usage: 48.7 KB
df_num_fill_median.head(5)
| Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|
| 0 | 3.0 | 22.0 | 1.0 | 0.0 | 7.2500 |
| 1 | 1.0 | 38.0 | 1.0 | 0.0 | 71.2833 |
| 2 | 3.0 | 26.0 | 0.0 | 0.0 | 7.9250 |
| 3 | 1.0 | 35.0 | 1.0 | 0.0 | 53.1000 |
| 4 | 3.0 | 35.0 | 0.0 | 0.0 | 8.0500 |
df_cat_encoded.head(5)
| Sex | Embarked | |
|---|---|---|
| 0 | 1.0 | 2.0 |
| 1 | 0.0 | 0.0 |
| 2 | 0.0 | 2.0 |
| 3 | 0.0 | 2.0 |
| 4 | 1.0 | 2.0 |
df_cat_onehot_encoded.head(5)
| x0_female | x0_male | x1_C | x1_Q | x1_S | |
|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
# traning dataset using onehot encoder
titanic_train_onehot_encoded = pd.concat([df_num_fill_median, df_cat_onehot_encoded], axis=1)
titanic_train_onehot_encoded.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 889 entries, 0 to 888 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 889 non-null float64 1 Age 889 non-null float64 2 SibSp 889 non-null float64 3 Parch 889 non-null float64 4 Fare 889 non-null float64 5 x0_female 889 non-null float64 6 x0_male 889 non-null float64 7 x1_C 889 non-null float64 8 x1_Q 889 non-null float64 9 x1_S 889 non-null float64 dtypes: float64(10) memory usage: 69.6 KB
titanic_train_onehot_encoded.head(5)
| Pclass | Age | SibSp | Parch | Fare | x0_female | x0_male | x1_C | x1_Q | x1_S | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 22.0 | 1.0 | 0.0 | 7.2500 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 1 | 1.0 | 38.0 | 1.0 | 0.0 | 71.2833 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 2 | 3.0 | 26.0 | 0.0 | 0.0 | 7.9250 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 1.0 | 35.0 | 1.0 | 0.0 | 53.1000 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 3.0 | 35.0 | 0.0 | 0.0 | 8.0500 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
# train a DT model using titanic_train_encoded
from sklearn.tree import DecisionTreeClassifier
tree_clf1 = DecisionTreeClassifier(criterion='entropy', max_depth=3)
tree_clf1.fit(titanic_train_encoded, y)
DecisionTreeClassifier(criterion='entropy', max_depth=3)
# train another DT model using titanic_train_onehot_encoded
from sklearn.tree import DecisionTreeClassifier
tree_clf2 = DecisionTreeClassifier(criterion='entropy', max_depth=3)
tree_clf2.fit(titanic_train_onehot_encoded, y)
DecisionTreeClassifier(criterion='entropy', max_depth=3)
When we train tree_clf1/tree_clf2 above, titanic_train_encoded/titanic_train_onehot_encoded is the input dataframe we used.
When we make predictions, the input dataframe we feed into the model MUST match the structure (the number of columns with specific order) of titanic_train_encoded/titanic_train_onehot_encoded.
df_test = pd.read_csv('../data/titanic_test.csv')
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
# save PassengerId column for submission
df_test_id = df_test['PassengerId']
df_test_id.head()
0 892 1 893 2 894 3 895 4 896 Name: PassengerId, dtype: int64
# we dropped PassengerId, Name, Cabin, Ticket when training the model
# so we should drop those columns for prediction
df_test.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null int64 1 Sex 418 non-null object 2 Age 332 non-null float64 3 SibSp 418 non-null int64 4 Parch 418 non-null int64 5 Fare 417 non-null float64 6 Embarked 418 non-null object dtypes: float64(2), int64(3), object(2) memory usage: 23.0+ KB
# we also need to handle missing values for the testing set
df_test.isnull().sum()
Pclass 0 Sex 0 Age 86 SibSp 0 Parch 0 Fare 1 Embarked 0 dtype: int64
# we use SimpleImputer to fill missing values in Age and Fare using median
median_imputer = SimpleImputer(strategy='median')
# select only numerical attributes for Imputer
df_test_num = df_test.select_dtypes(include=['int64','float64'])
# the following computes median for each attributes and store the result in statistics_ variable
df_test_num_fill_median = median_imputer.fit_transform(df_test_num)
# convert df_test_num_fill_mean to dataframe
df_test_num_fill_median = pd.DataFrame(df_test_num_fill_median, columns=df_test_num.columns)
df_test_num_fill_median.isnull().sum() # no missing values
Pclass 0 Age 0 SibSp 0 Parch 0 Fare 0 dtype: int64
# test numerical dtatframe shape is 418x5
df_test_num_fill_median.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null float64 1 Age 418 non-null float64 2 SibSp 418 non-null float64 3 Parch 418 non-null float64 4 Fare 418 non-null float64 dtypes: float64(5) memory usage: 16.5 KB
# select only categorical attributes for Encoder
df_test_cat = df_test.select_dtypes(include=['object'])
df_test_cat.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 418 non-null object 1 Embarked 418 non-null object dtypes: object(2) memory usage: 6.7+ KB
df_test_cat.head()
| Sex | Embarked | |
|---|---|---|
| 0 | male | Q |
| 1 | female | S |
| 2 | male | Q |
| 3 | male | S |
| 4 | female | S |
# You must encode the test categorical dataframe
# using the Ordinal Encoder we created for the training set
# DO NOT create a new encoder!!!
df_test_cat_encoded = cat_encoder.transform(df_test_cat)
# convert df_test_cat_encoded to dataframe
df_test_cat_encoded = pd.DataFrame(df_test_cat_encoded, columns=df_test_cat.columns)
df_test_cat_encoded.head()
| Sex | Embarked | |
|---|---|---|
| 0 | 1.0 | 1.0 |
| 1 | 0.0 | 2.0 |
| 2 | 1.0 | 1.0 |
| 3 | 1.0 | 2.0 |
| 4 | 0.0 | 2.0 |
# combine the dataframes as the test input dataframe
titanic_test_encoded = pd.concat([df_test_num_fill_median, df_test_cat_encoded], axis=1)
titanic_test_encoded.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null float64 1 Age 418 non-null float64 2 SibSp 418 non-null float64 3 Parch 418 non-null float64 4 Fare 418 non-null float64 5 Sex 418 non-null float64 6 Embarked 418 non-null float64 dtypes: float64(7) memory usage: 23.0 KB
# the column array for the testing input dataframe using ordinal encoding
titanic_test_encoded.columns.values
array(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex', 'Embarked'],
dtype=object)
# the column array for the training input dataframe using ordinal encoding
titanic_train_encoded.columns.values
array(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex', 'Embarked'],
dtype=object)
titanic_test_encoded.head()
| Pclass | Age | SibSp | Parch | Fare | Sex | Embarked | |
|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 34.5 | 0.0 | 0.0 | 7.8292 | 1.0 | 1.0 |
| 1 | 3.0 | 47.0 | 1.0 | 0.0 | 7.0000 | 0.0 | 2.0 |
| 2 | 2.0 | 62.0 | 0.0 | 0.0 | 9.6875 | 1.0 | 1.0 |
| 3 | 3.0 | 27.0 | 0.0 | 0.0 | 8.6625 | 1.0 | 2.0 |
| 4 | 3.0 | 22.0 | 1.0 | 1.0 | 12.2875 | 0.0 | 2.0 |
# make prediction using the first tree
y_hat_ordinal = tree_clf1.predict(titanic_test_encoded)
y_hat_ordinal.shape
(418,)
# Note, the numbers here are float numbers with decimal, we need to convert them to integer
# if you don't do this, your score on Kaggle will be 0
y_hat_ordinal
array([0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.,
0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0.,
0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0.,
1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1.,
0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 1.,
0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1.,
0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,
1., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
1., 1., 1., 1., 1., 0., 1., 0., 0., 0.])
y_hat_ordinal.astype(int)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])
# make the dataframe for submission by combining two columns
tree1_ordinal_submit = pd.DataFrame({
'PassengerId': df_test_id,
'Survived': y_hat_ordinal.astype(int),
})
tree1_ordinal_submit.head()
| PassengerId | Survived | |
|---|---|---|
| 0 | 892 | 0 |
| 1 | 893 | 1 |
| 2 | 894 | 0 |
| 3 | 895 | 0 |
| 4 | 896 | 1 |
# save the resulting dataframe as a csv file for Kaggle submission
tree1_ordinal_submit.to_csv('../data/tree1_ordinal_submit.csv', index=False)
Next, you need to follow the same process and apply the second decision tree "tree_clf2" to the test set with the two categorical variables 'Sex' and 'Embarked' encoded by the 'onehot_encoder'. Name the output file "tree2_onhot_submit.csv"
df_test_cat.head
<bound method NDFrame.head of Sex Embarked 0 male Q 1 female S 2 male Q 3 male S 4 female S .. ... ... 413 male S 414 female C 415 male S 416 male S 417 male C [418 rows x 2 columns]>
df_test_cat_onthot_encoded = onehot_encoder.fit_transform(df_test_cat)
column_names = onehot_encoder.get_feature_names()
df_test_cat_onthot_encoded = df_test_cat_onthot_encoded.toarray()
df_test_cat_onthot_encoded = pd.DataFrame(df_test_cat_onthot_encoded, columns=column_names)
df_test_cat_onthot_encoded.head()
| x0_female | x0_male | x1_C | x1_Q | x1_S | |
|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
titanic_test_encoded = pd.concat([df_test_num_fill_median, df_test_cat_onthot_encoded], axis=1)
titanic_test_encoded.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null float64 1 Age 418 non-null float64 2 SibSp 418 non-null float64 3 Parch 418 non-null float64 4 Fare 418 non-null float64 5 x0_female 418 non-null float64 6 x0_male 418 non-null float64 7 x1_C 418 non-null float64 8 x1_Q 418 non-null float64 9 x1_S 418 non-null float64 dtypes: float64(10) memory usage: 32.8 KB
titanic_test_encoded.columns.values
array(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'x0_female', 'x0_male',
'x1_C', 'x1_Q', 'x1_S'], dtype=object)
titanic_train_encoded.columns.values
array(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex', 'Embarked'],
dtype=object)
titanic_test_encoded.head()
| Pclass | Age | SibSp | Parch | Fare | x0_female | x0_male | x1_C | x1_Q | x1_S | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 34.5 | 0.0 | 0.0 | 7.8292 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 1 | 3.0 | 47.0 | 1.0 | 0.0 | 7.0000 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 2.0 | 62.0 | 0.0 | 0.0 | 9.6875 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 3 | 3.0 | 27.0 | 0.0 | 0.0 | 8.6625 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 4 | 3.0 | 22.0 | 1.0 | 1.0 | 12.2875 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
y_hat_onehot = tree_clf2.predict(titanic_test_encoded)
y_hat_onehot.shape
(418,)
y_hat_onehot
array([0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.,
0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0.,
0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0.,
1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1.,
0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 1.,
0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 1.,
0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 1.,
0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
1., 1., 1., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0.,
1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,
1., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
1., 1., 1., 1., 1., 0., 1., 0., 0., 0.])
y_hat_onehot.astype(int)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])
tree2_onehot_submit = pd.DataFrame({
'PassengerId': df_test_id,
'Survived': y_hat_ordinal.astype(int),
})
tree2_onehot_submit.head()
| PassengerId | Survived | |
|---|---|---|
| 0 | 892 | 0 |
| 1 | 893 | 1 |
| 2 | 894 | 0 |
| 3 | 895 | 0 |
| 4 | 896 | 1 |
tree2_onehot_submit.to_csv('../data/tree2_onehot_submit.csv', index=False)
Regression is one of the most useful and popular functions in data mining and statisitc learning.
Regression is aimed at building a model that can predict the value of y based on X.
There are several key concepts related to regression:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
import seaborn as sns
# let's generate some linear-looking data
# the underlying true model is y = 5 + 3x
# fix the random seed so that each run generates the same set of random numbers
np.random.seed(1)
# generate 100 random numbers between 0 and 2 with shape (100, 1)
X = 2 * np.random.rand(100, 1)
# generate 100 random numbers from a normal distribution
y = 5 + 3 * X + np.random.randn(100, 1)
fig, ax = plt.subplots()
ax.plot(X, y, ".")
ax.plot(X, 5+3*X)
[<matplotlib.lines.Line2D at 0x18789d188e0>]
# Use sklearn for linear regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y) # use the normal equation to train the model
LinearRegression()
# linear regression results using sklearn
# print out the intercept and coefficient(s)
print(lr.intercept_, lr.coef_)
[5.23695725] [[2.84246254]]
a0 = lr.intercept_[0]
a1 = lr.coef_[0,0]
(a0,a1)
(5.2369572541489084, 2.8424625438276605)
# Plot both the real model and the fitted model in the same graph
fig, ax = plt.subplots()
ax.plot(X, y, ".")
ax.plot(X, 5+3*X, label='real')
ax.plot(X, a0+a1*X, label='fitted')
fig.legend()
<matplotlib.legend.Legend at 0x18789d67820>
# making predictions using the model
X_new = np.array([[0.6], [0.9], [1.3]])
y_new_pred = lr.predict(X_new)
y_new_pred
array([[6.94243478],
[7.79517354],
[8.93215856]])
fig, ax = plt.subplots()
ax.plot(X, y, ".")
ax.plot(X, 5+3*X)
ax.plot(X_new, y_new_pred, 'o')
[<matplotlib.lines.Line2D at 0x18789e0b100>]
# Apply the model to X in the training set
y_pred = lr.predict(X)
# calculate MSE and RMSE
# NOTE: the RMSE is measured on the same scale with the same units as y.
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, y_pred)
print(mse)
# Set "squared=False" to get RMSE
rmse = mean_squared_error(y, y_pred, squared=False)
print(rmse)
0.7997618656011584 0.8942940599160649
# R^2 of the model for the training set
lr.score(X, y)
0.7778975321559937
You can do linear regression using statsmodels package as follows, which gives you more information (such as R-squared and p-value) from a statistics perspective.
Checkout more at: https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm
import statsmodels.api as sm
# if you want to see the p-value, etc. use the following code
X1 = sm.add_constant(X) # Need this line to add a constant (intercept) to the linear model
ols_reg = sm.OLS(y, X1)
res = ols_reg.fit()
print(res.summary())
# exactly same results as sklearn
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.778
Model: OLS Adj. R-squared: 0.776
Method: Least Squares F-statistic: 343.2
Date: Mon, 23 May 2022 Prob (F-statistic): 8.70e-34
Time: 15:07:39 Log-Likelihood: -130.72
No. Observations: 100 AIC: 265.4
Df Residuals: 98 BIC: 270.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.2370 0.174 30.041 0.000 4.891 5.583
x1 2.8425 0.153 18.527 0.000 2.538 3.147
==============================================================================
Omnibus: 2.308 Durbin-Watson: 2.206
Prob(Omnibus): 0.315 Jarque-Bera (JB): 1.753
Skew: -0.189 Prob(JB): 0.416
Kurtosis: 3.528 Cond. No. 3.61
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Polynomial Regression can fit non-linear data to a linear model by adding powers of each feature as new features and then train a linear model on the extended set of features.
# generate some non-linear data
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 1 + 2* X + 3 * X**2 + np.random.randn(m, 1) # the "real" function is y = 1 + 2*x + 3*x^2
fig, ax = plt.subplots()
ax.plot(X, y, ".")
[<matplotlib.lines.Line2D at 0x1878ae55640>]
from sklearn.preprocessing import PolynomialFeatures
# a is a column vector representing two features x1, x2
a = np.array([[2, 3, 5], [7, 11, 13] ]).T
print(a)
# degree=2 added x1*x2, x1^2, x2^2
poly_features_2 = PolynomialFeatures(degree=2, include_bias=False)
b = poly_features_2.fit_transform(a)
print(b)
# degree=3 added x1*x2, x1^2, x2^2, x1^2*2, x2^2*x1, x1^3, x2^3
poly_features_3 = PolynomialFeatures(degree=3, include_bias=False)
c = poly_features_3.fit_transform(a)
print(c)
[[ 2 7] [ 3 11] [ 5 13]] [[ 2. 7. 4. 14. 49.] [ 3. 11. 9. 33. 121.] [ 5. 13. 25. 65. 169.]] [[2.000e+00 7.000e+00 4.000e+00 1.400e+01 4.900e+01 8.000e+00 2.800e+01 9.800e+01 3.430e+02] [3.000e+00 1.100e+01 9.000e+00 3.300e+01 1.210e+02 2.700e+01 9.900e+01 3.630e+02 1.331e+03] [5.000e+00 1.300e+01 2.500e+01 6.500e+01 1.690e+02 1.250e+02 3.250e+02 8.450e+02 2.197e+03]]
print(X[:3])
[[-0.75275929] [ 2.70428584] [ 1.39196365]]
# add polynomial features for X in the plot
X_poly = poly_features_2.fit_transform(X)
print(X_poly[:3]) # X, X^2
[[-0.75275929 0.56664654] [ 2.70428584 7.3131619 ] [ 1.39196365 1.93756281]]
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
print(lin_reg.intercept_, lin_reg.coef_)
[0.78134581] [[1.93366893 3.06456263]]
a0 = lin_reg.intercept_[0]
a1 = lin_reg.coef_[0,0]
a2 = lin_reg.coef_[0,1]
print(a0,a1,a2)
0.7813458120291443 1.9336689322536071 3.0645626336170753
y_pred = a0 + a1*X + a2*X**2
# the "real" function is y = 1 + 2*x + 3*x^2
# the fitted model: y = 0.781 + 1.934*X + 3.065*X^2 - pretty close
fig, ax = plt.subplots()
ax.plot(X, y, ".")
ax.plot(X, y_pred, "r.")
[<matplotlib.lines.Line2D at 0x1878aec3190>]
# generate some non-linear data
np.random.seed(12)
m = 100
X = np.random.rand(m, 1)*2 # generate random numbers between 0 and 2.
y = np.exp(1 + 2*X) + np.random.randn(m, 1) # the "real" function is y = e^(1 + 2*x)
fig, ax = plt.subplots()
ax.plot(X, y, ".")
[<matplotlib.lines.Line2D at 0x1878af24940>]
from sklearn.preprocessing import FunctionTransformer
# Transform a variable x using function log(1 + x)
log_transformer = FunctionTransformer(np.log1p)
a = np.array([[0], [1], [4]])
a_new = log_transformer.transform(a)
print(a)
print(a_new)
[[0] [1] [4]] [[0. ] [0.69314718] [1.60943791]]
log_y = log_transformer.transform(y)
print(y[:5])
print(log_y[:5])
[[ 6.44392545] [50.78005844] [ 9.26450025] [24.62412344] [ 2.42007296]] [[2.00739832] [3.9470051 ] [2.32869136] [3.24353423] [1.22966188]]
fig, ax = plt.subplots()
ax.plot(X, log_y, ".")
[<matplotlib.lines.Line2D at 0x1878af968b0>]
lin_reg = LinearRegression()
lin_reg.fit(X, log_y)
print(lin_reg.intercept_, lin_reg.coef_)
[1.10810736] [[1.94320646]]
a0 = lin_reg.intercept_[0]
a1 = lin_reg.coef_[0,0]
(a0,a1)
(1.1081073623725604, 1.9432064635549418)
# predicted values of y.
# make sure to transform the value using function exp(x) - 1
y_pred = np.expm1(a0 + a1*X)
# the "real" function is y = e^(1 + 2*x)
# the fitted model: y = e^(a0 + a1*x)
fig, ax = plt.subplots()
ax.plot(X, y, ".")
ax.plot(X, y_pred, "r.")
[<matplotlib.lines.Line2D at 0x1878b004340>]
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
diamonds.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
diamonds.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53940 entries, 0 to 53939 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53940 non-null float64 1 cut 53940 non-null category 2 color 53940 non-null category 3 clarity 53940 non-null category 4 depth 53940 non-null float64 5 table 53940 non-null float64 6 price 53940 non-null int64 7 x 53940 non-null float64 8 y 53940 non-null float64 9 z 53940 non-null float64 dtypes: category(3), float64(6), int64(1) memory usage: 3.0 MB
diamonds.describe(include='all')
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 53940.000000 | 53940 | 53940 | 53940 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
| unique | NaN | 5 | 7 | 8 | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | Ideal | G | SI1 | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 21551 | 11292 | 13065 | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 0.797940 | NaN | NaN | NaN | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
| std | 0.474011 | NaN | NaN | NaN | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
| min | 0.200000 | NaN | NaN | NaN | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.400000 | NaN | NaN | NaN | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | NaN | NaN | NaN | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 1.040000 | NaN | NaN | NaN | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
| max | 5.010000 | NaN | NaN | NaN | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
In our exploratory data analysis (EDA), we’ve seen some surprising relationships between the quality of diamonds and their price: low quality diamonds (poor cuts, bad colours, and inferior clarity) have higher prices.
# Fair cuts have highter price?
sns.boxplot(data=diamonds, x='cut', y='price')
<AxesSubplot:xlabel='cut', ylabel='price'>
# The worst diamond color is J (slightly yellow)
sns.boxplot(data=diamonds, x='color', y='price')
<AxesSubplot:xlabel='color', ylabel='price'>
# The worst clarity is I1 (inclusions visible to the naked eye).
sns.boxplot(data=diamonds, x='clarity', y='price')
<AxesSubplot:xlabel='clarity', ylabel='price'>
Do these charts mean lower quality diamonds have higher prices? If that's the case, why do people pay higher prices for lower quality?
Do not forget there is an important confounding variable: the weight (carat) of the diamond. The weight of the diamond is the single most important factor for determining the price of the diamond, and lower quality diamonds tend to be larger.
diamonds.plot(x='carat', y='price', kind='scatter', alpha=0.5)
<AxesSubplot:xlabel='carat', ylabel='price'>
We build a simple regression model to predict diamond price by carat.
# Use sklearn for linear regression
from sklearn.linear_model import LinearRegression
diamonds_y = diamonds['price']
diamonds_y
0 326
1 326
2 327
3 334
4 335
...
53935 2757
53936 2757
53937 2757
53938 2757
53939 2757
Name: price, Length: 53940, dtype: int64
# X as a dataframe (numpy array)
diamonds_X = diamonds[['carat']]
diamonds_X
| carat | |
|---|---|
| 0 | 0.23 |
| 1 | 0.21 |
| 2 | 0.23 |
| 3 | 0.29 |
| 4 | 0.31 |
| ... | ... |
| 53935 | 0.72 |
| 53936 | 0.72 |
| 53937 | 0.70 |
| 53938 | 0.86 |
| 53939 | 0.75 |
53940 rows × 1 columns
lm1 = LinearRegression()
lm1.fit(diamonds_X, diamonds_y)
LinearRegression()
print(lm1.intercept_, lm1.coef_)
-2256.3605800454575 [7756.42561797]
# R^2 of the model
lm1.score(diamonds_X, diamonds_y)
0.8493305264354858
# Calculate the predicted price
diamonds_y_pred = lm1.predict(diamonds_X)
# LM1 P=-2256 + 7756 * Carat + e
# treugolCarat = 2 treugolP + 7756
diamonds.plot(x='carat', y='price', kind='scatter', alpha=0.5)
plt.scatter(diamonds['carat'], diamonds_y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x221a0a73640>
# RMSE: This indidcates how off the predicted values are from the actual values.
from sklearn.metrics import mean_squared_error
lm1_rmse = mean_squared_error(diamonds_y, diamonds_y_pred, squared=False)
lm1_rmse
1548.5331930613174
# How many diamonds are larger than 2.5 carats (99.7% of the data)
bigrock = diamonds.carat>2.5
bigrock.value_counts(normalize=True)
False 0.997664 True 0.002336 Name: carat, dtype: float64
# Focus on diamonds smaller than 2.5 carats.
df = diamonds[diamonds.carat<=2.5]
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53814 entries, 0 to 53939 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 carat 53814 non-null float64 1 cut 53814 non-null category 2 color 53814 non-null category 3 clarity 53814 non-null category 4 depth 53814 non-null float64 5 table 53814 non-null float64 6 price 53814 non-null int64 7 x 53814 non-null float64 8 y 53814 non-null float64 9 z 53814 non-null float64 dtypes: category(3), float64(6), int64(1) memory usage: 3.4 MB
df = df.reset_index().drop(columns=['index'])
df
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 53809 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53810 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53811 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53812 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53813 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53814 rows × 10 columns
df_X = df[['carat']]
df_y = df['price']
# Build a new lineary model using data without the outliers
lm2 = LinearRegression()
lm2.fit(df_X, df_y)
LinearRegression()
print(lm2.intercept_, lm2.coef_)
-2330.6555046788417 [7862.1680473]
# R^2 of the model
lm2.score(df_X, df_y)
0.8519722572415094
# Calculate the predicted price using the new model
y_pred = lm2.predict(df_X)
# Estimate the RMSE
mean_squared_error(df_y, y_pred, squared=False)
# 1 INCREASE FOR CUT INCREASE PRICE BY CARAT NUMBER
# LM2 P= -2330 + 7862*C
1520.698006853508
df.plot(x='carat', y='price', kind='scatter', alpha=0.5)
plt.scatter(df['carat'], y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x221a0afe5e0>
# Let's only keep carat and cut as independent variables.
# You can include more variables (e.g., color, clarity) later if you want.
df_X = df[['carat','cut']]
df_y = df['price']
df_X.head()
| carat | cut | |
|---|---|---|
| 0 | 0.23 | Ideal |
| 1 | 0.21 | Premium |
| 2 | 0.23 | Good |
| 3 | 0.29 | Premium |
| 4 | 0.31 | Good |
# 'cut' is a categorical variable with five possible values
df_X.cut.value_counts()
Ideal 21528 Premium 13745 Very Good 12063 Good 4889 Fair 1589 Name: cut, dtype: int64
# You could use sklearn's encoder such as OneHotEncoder to encode this variable.
# Alternatively, use get_dummies() to converting 'cut' to dummy variables, i.e., one-hot encoding
df_X_onehot = pd.get_dummies(df_X)
df_X_onehot.head()
| carat | cut_Ideal | cut_Premium | cut_Very Good | cut_Good | cut_Fair | |
|---|---|---|---|---|---|---|
| 0 | 0.23 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0.21 | 0 | 1 | 0 | 0 | 0 |
| 2 | 0.23 | 0 | 0 | 0 | 1 | 0 |
| 3 | 0.29 | 0 | 1 | 0 | 0 | 0 |
| 4 | 0.31 | 0 | 0 | 0 | 1 | 0 |
# To differentiate the five levels of cut, we only need to keep four dummy variables
df_X1 = df_X_onehot[['carat','cut_Ideal','cut_Premium','cut_Very Good','cut_Good']]
# Build a new
lm_onhot = LinearRegression()
lm_onhot.fit(df_X1, df_y)
LinearRegression()
# Print the intercept and coefficients
print(lm_onhot.intercept_, lm_onhot.coef_)
-3895.040864200172 [7974.56625425 1751.30245509 1379.87435057 1452.57396734 1063.00747392]
How to interpret the coefficients?
carat: 7974.56625425 cut_Ideal: 1751.30245509 cut_Premium: 1379.87435057 cut_Very Good: 1452.57396734 cut_Good: 1063.00747392
# R^2
lm_onhot.score(df_X1, df_y)
0.8590074959441552
# RMSE
y_pred = lm_onhot.predict(df_X1)
mean_squared_error(df_y, y_pred, squared=False)
1484.1214098595588
With the knowledge that the five values of cut, ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], indicate the quality from low to high, we could also encode this variable using an ordinal encoder (0, 1, 2, 3, 4).
# select the variable to be encoded
cat_df = df[['cut']]
cat_df.head()
| cut | |
|---|---|
| 0 | Ideal |
| 1 | Premium |
| 2 | Good |
| 3 | Premium |
| 4 | Good |
# OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
# You can specify the order of values for endocing
ord_encoder = OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']])
cat_ord_encoded = ord_encoder.fit_transform(cat_df)
ord_encoder.categories_
[array(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], dtype=object)]
cat_ord_encoded
array([[4.],
[3.],
[1.],
...,
[2.],
[3.],
[4.]])
# convert cat_ord_encoded into dataframe
cat_df_ord_encoded = pd.DataFrame(cat_ord_encoded, columns=cat_df.columns)
cat_df_ord_encoded
| cut | |
|---|---|
| 0 | 4.0 |
| 1 | 3.0 |
| 2 | 1.0 |
| 3 | 3.0 |
| 4 | 1.0 |
| ... | ... |
| 53809 | 4.0 |
| 53810 | 1.0 |
| 53811 | 2.0 |
| 53812 | 3.0 |
| 53813 | 4.0 |
53814 rows × 1 columns
cat_df.cut.value_counts()
Ideal 21528 Premium 13745 Very Good 12063 Good 4889 Fair 1589 Name: cut, dtype: int64
cat_df_ord_encoded.cut.value_counts()
4.0 21528 3.0 13745 2.0 12063 1.0 4889 0.0 1589 Name: cut, dtype: int64
df_X_ord = pd.concat([df_X[['carat']], cat_df_ord_encoded], axis=1)
df_X_ord
| carat | cut | |
|---|---|---|
| 0 | 0.23 | 4.0 |
| 1 | 0.21 | 3.0 |
| 2 | 0.23 | 1.0 |
| 3 | 0.29 | 3.0 |
| 4 | 0.31 | 1.0 |
| ... | ... | ... |
| 53809 | 0.72 | 4.0 |
| 53810 | 0.72 | 1.0 |
| 53811 | 0.70 | 2.0 |
| 53812 | 0.86 | 3.0 |
| 53813 | 0.75 | 4.0 |
53814 rows × 2 columns
# Build a new model
lm_ord = LinearRegression()
lm_ord.fit(df_X_ord, df_y)
LinearRegression()
# Print the intercept and coefficients
print(lm_ord.intercept_, lm_ord.coef_)
-3139.8677974900493 [7943.41866224 256.31917809]
carat: 7943.41866224 cut: 256.31917809
# R^2
lm_ord.score(df_X_ord, df_y)
0.8571147927378382
# RMSE
df_y_pred = lm_ord.predict(df_X_ord)
mean_squared_error(df_y, df_y_pred, squared=False)
1494.0497284318856
The relationship between carat and price seems to be non-linear.
df.plot(x='carat', y='price', kind='scatter', alpha=0.5)
<AxesSubplot:xlabel='carat', ylabel='price'>
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
df['log_price'] = transformer.transform(df.price)
df['log_price']
0 5.789960
1 5.789960
2 5.793014
3 5.814131
4 5.817111
...
53809 7.922261
53810 7.922261
53811 7.922261
53812 7.922261
53813 7.922261
Name: log_price, Length: 53814, dtype: float64
df['log_carat'] = transformer.transform(df.carat)
df.plot(x='log_carat', y='log_price', kind='scatter', alpha=0.5)
<AxesSubplot:xlabel='log_carat', ylabel='log_price'>
df_X_log = df[['log_carat']]
df_X_log
| log_carat | |
|---|---|
| 0 | 0.207014 |
| 1 | 0.190620 |
| 2 | 0.207014 |
| 3 | 0.254642 |
| 4 | 0.270027 |
| ... | ... |
| 53809 | 0.542324 |
| 53810 | 0.542324 |
| 53811 | 0.530628 |
| 53812 | 0.620576 |
| 53813 | 0.559616 |
53814 rows × 1 columns
df_y_log = df['log_price']
df_y_log
0 5.789960
1 5.789960
2 5.793014
3 5.814131
4 5.817111
...
53809 7.922261
53810 7.922261
53811 7.922261
53812 7.922261
53813 7.922261
Name: log_price, Length: 53814, dtype: float64
lm_log = LinearRegression()
lm_log.fit(df_X_log, df_y_log)
LinearRegression()
# Print the intercept and coefficients
print(lm_log.intercept_, lm_log.coef_)
# LM.Log (log.P)= 5.57 + 3.99(Log.C)
5.57516028775912 [3.98906194]
# How to interpret the coefficients?
# LM.Log (log.P)= 5.57 + 3.99(Log.C)
# R^2
lm_log.score(df_X_log, df_y_log)
0.9125661358395424
# Make predictions using the new model
# using log_carat to predit log_price
log_y_pred = lm_log.predict(df_X_log)
# You must convert log price back to price
y_pred = np.expm1(log_y_pred)
# RMSE
mean_squared_error(df_y, y_pred, squared=False)
2480.1502949274745
df.plot(x='log_carat', y='log_price', kind='scatter', alpha=0.5)
plt.scatter(df['log_carat'], log_y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x221a07ef070>
df.plot(x='carat', y='price', kind='scatter', alpha=0.5)
plt.scatter(df['carat'], y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x221a07f3ca0>
Build an other regression model (e.g., including other variables such as color and clarity, adding categorical variable(s) to the log-transformed model). Test its performance.
df_X = df[['carat','color','clarity']]
df_y = df['price']
df_X_onehot = pd.get_dummies(df_X)
df_X_onehot.head()
| carat | color_D | color_E | color_F | color_G | color_H | color_I | color_J | clarity_IF | clarity_VVS1 | clarity_VVS2 | clarity_VS1 | clarity_VS2 | clarity_SI1 | clarity_SI2 | clarity_I1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0.21 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 0.23 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 0.29 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 0.31 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
df_X1 = df_X_onehot[['carat','color_D','color_E','color_F','color_G','color_H','color_I','color_J','clarity_IF','clarity_VVS1','clarity_VVS2','clarity_VS1','clarity_VS2','clarity_SI1','clarity_SI2','clarity_I1']]
lm_onhot = LinearRegression()
lm_onhot.fit(df_X1, df_y)
print(lm_onhot.intercept_, lm_onhot.coef_)
-986150562889391.0 [ 8.95888576e+03 -6.34183753e+14 -6.34183753e+14 -6.34183753e+14 -6.34183753e+14 -6.34183753e+14 -6.34183753e+14 -6.34183753e+14 1.62033432e+15 1.62033432e+15 1.62033432e+15 1.62033432e+15 1.62033432e+15 1.62033432e+15 1.62033432e+15 1.62033432e+15]
lm_onhot.score(df_X1, df_y)
y_pred = lm_onhot.predict(df_X1)
mean_squared_error(df_y, y_pred, squared=False)
1141.7565879781826
df.plot(x='carat', y='price', kind='scatter', alpha=0.5)
plt.scatter(df['carat'], y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x2219b472e80>
cat_df = df[['color']]
from sklearn.preprocessing import OrdinalEncoder
ord_encoder = OrdinalEncoder(categories=[['D','E','F','G','H','I','J']])
cat_ord_encoded = ord_encoder.fit_transform(cat_df)
cat_df_ord_encoded = pd.DataFrame(cat_ord_encoded, columns=cat_df.columns)
df_X_ord = pd.concat([df_X[['carat']], cat_df_ord_encoded], axis=1)
lm_ord = LinearRegression()
lm_ord.fit(df_X_ord, df_y)
LinearRegression()
print(lm_ord.intercept_, lm_ord.coef_)
-1893.4388230847271 [8124.91192189 -249.27866003]
lm_ord.score(df_X_ord, df_y)
0.8625071024737584
df_y_pred = lm_ord.predict(df_X_ord)
mean_squared_error(df_y, df_y_pred, squared=False)
1465.5868193667095
df.plot(x='carat', y='price', kind='scatter', alpha=0.5)
plt.scatter(df['carat'], df_y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x2219f0d1c10>
cat_df = df[['clarity']]
from sklearn.preprocessing import OrdinalEncoder
ord_encoder = OrdinalEncoder(categories=[['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1']])
cat_ord_encoded = ord_encoder.fit_transform(cat_df)
cat_df_ord_encoded = pd.DataFrame(cat_ord_encoded, columns=cat_df.columns)
df_X_ord = pd.concat([df_X[['carat']], cat_df_ord_encoded], axis=1)
lm_ord = LinearRegression()
lm_ord.fit(df_X_ord, df_y)
LinearRegression()
print(lm_ord.intercept_, lm_ord.coef_)
-864.2152779847929 [8472.19120163 -494.51685351]
lm_ord.score(df_X_ord, df_y)
0.889219860710435
df_y_pred = lm_ord.predict(df_X_ord)
mean_squared_error(df_y, df_y_pred, squared=False)
1315.5348932224024
df.plot(x='carat', y='price', kind='scatter', alpha=0.5)
plt.scatter(df['carat'], df_y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x2219b478b50>
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
df['log_price'] = transformer.transform(df.price)
df['log_depth'] = transformer.transform(df.depth)
df.plot(x='log_depth', y='log_price', kind='scatter', alpha=0.5)
df_X_log = df[['log_depth']]
df_y_log = df['log_price']
lm_log = LinearRegression()
lm_log.fit(df_X_log, df_y_log)
lm_log.score(df_X_log, df_y_log)
log_y_pred = lm_log.predict(df_X_log)
y_pred = np.expm1(log_y_pred)
mean_squared_error(df_y, y_pred, squared=False)
4230.0058942508
df.plot(x='log_depth', y='log_price', kind='scatter', alpha=0.5)
plt.scatter(df['log_depth'], log_y_pred, s=1, color='red')
df.plot(x='depth', y='price', kind='scatter', alpha=0.5)
plt.scatter(df['depth'], y_pred, s=1, color='red')
<matplotlib.collections.PathCollection at 0x221a083a9d0>
In this course, we will use the Python programming language for all tutorials, exercises, and assignments.
Python is a great general-purpose programming language. With the help of several popular libraries (e.g., numpy, scipy, pandas, matplotlib, sklearn), it provides a powerful environment for data analytics and computing.
It'd be great if you have some experience with Python and numpy. If not, we will take this notebook as a crash course on basics for Python programming and its use for scientific computing.
Many say that Python code is like pseudocode for it can express very powerful ideas in very few lines of code while being very readable.
As an example, here is an implementation of the classic quicksort algorithm in Python:
# Define a fuction named "quicksort" that takes an array "arr" as input
def quicksort(arr):
# check if the array is empty
if len(arr) <= 1:
return arr
# find the element in the middle of the of array
pivot = arr[len(arr) // 2]
# save all elements less than the pivot in an list called "left"
left = [x for x in arr if x < pivot]
# save all elements equal to the pivot in an list called "middle"
middle = [x for x in arr if x == pivot]
# save all elements greater than the pivot in an list called "right"
right = [x for x in arr if x > pivot]
# combine left + middle + right together
# notice that this quicksort function is recurisive: it calls itself.
return quicksort(left) + middle + quicksort(right)
'''
You can call this function to sort a list of numbers.
You can also use this function to sort a list of strings.
BTW. Here I show another way of commenting multiple lines in Python code.
'''
print(quicksort([3,8,16,9,14,8,10,1,2,19]))
print(quicksort(['Google','Apple','Microsoft','Amazon']))
[1, 2, 3, 8, 8, 9, 10, 14, 16, 19] ['Amazon', 'Apple', 'Google', 'Microsoft']
Like most languages, Python has a number of basic types including integers, floats, booleans, and strings.
Numbers: Integers and floats work as you would expect from other languages:
x = 3
print(type(x)) # Prints "<class 'int'>"
print(x)
print(x + 1) # Addition
print(x - 1) # Subtraction
print(x * 2) # Multiplication
print(x ** 2) # Exponentiation
x = x + 1 # Assign the value (x+1) to x
print(x)
x += 1 # This is simpler.
print(x)
x *= 2 # Assign the vlaue (x*2) to x
print(x)
y = 2.5
print(type(y)) # Prints "<class 'float'>"
print(y, y + 1, y * 2, y ** 2) # Prints multiple items together
<class 'int'> 3 4 2 6 9 4 5 10 <class 'float'> 2.5 3.5 5.0 6.25
Booleans: Python implements all of the usual operators for Boolean logic:
t = True
f = False
print(type(t)) # Prints "<class 'bool'>"
print(t and f) # Logical AND; prints "False"
print(t or f) # Logical OR; prints "True"
print(not t) # Logical NOT; prints "False"
x = x - 9
if x > 3:
print(str(x) + ' is greater than 3.')
<class 'bool'> False True False
x=2
print(x)
x=x+1
if x >=5:
print(str(x)+' is greater than 3.')
else:
print(str(x)+' is less than 5.')
2 3 is less than 5.
Strings: Python has great support for strings:
h = 'hello' # String literals can use single quotes
w = "world" # or double quotes; it does not matter.
print(h)
print(len(h)) # len() returns the string length
hw = h + ' ' + w # String concatenation
print(hw)
hwy = '%s %s %d' % (h, w, 2021) # string formatting
print(hwy)
hello 5 hello world hello world 2021
String objects have a bunch of useful methods:
s = "hello"
print(s.capitalize()) # Capitalize a string; prints "Hello"
print(s.upper()) # Convert a string to uppercase; prints "HELLO"
print('Hello'.lower()) # Convert a string to lowercase; prints "hello"
print(s.replace('l', '(ell)')) # Replace all instances of one substring with another;
print(' world '.strip()) # Strip leading and trailing whitespace; prints "world"
Hello HELLO hello he(ell)(ell)o world
# More functions/operaitons for strings
chords = 'C G Am Em F C Dm G'
print(chords.split()) # Split a string to a list of strings, default by ' '
print(chords.replace(' ','-').split('-')) # Split a string by '-'
['C', 'G', 'Am', 'Em', 'F', 'C', 'Dm', 'G'] ['C', 'G', 'Am', 'Em', 'F', 'C', 'Dm', 'G']
How to extract a substring from a string? You can use the following templates:
string[start:end]: Get all characters from index start to end-1string[:end]: Get all characters from the beginning of the string to end-1string[start:]: Get all characters from index start to the end of the stringstring[start:end:step]: Get all characters from start to end-1 discounting every step characterprint(len(chords)) # The length of a string
# substrings
print(chords[0]) # The first character
print(chords[0:3]) # The first three characters
print(chords[:3]) # The first three characters
print(chords[5:10]) # The characters from index 5 to 9
print(chords[-4]) # The 4th characters from the end
print(chords[-4:]) # The last four characters
18 C C G C G m Em D Dm G
Count the occurrence of a character in a string.
For example, count the occurrence of spaces (' ') in the string.
# Using a loop
count = 0
for i in range(len(chords)):
if chords[i] == ' ':
count += 1
count
7
# You can also iterate all characters in string like this:
count = 0
for ch in chords:
if ch == ' ':
count += 1
count
7
# Using count()
chords.count(' ')
7
See https://realpython.com/python-lists-tuples/
A Python list is a collection of arbitrary objects, similar to an array in many other programming languages.
# one dimensional list/array
a = ['apple', 'orange', 'kiwi', 'grape', 'cherry']
print(a[4])
print(a[1:3])
print(a[:3])
print(a[2:])
print(a[:-2])
print(a[1:-2])
cherry ['orange', 'kiwi'] ['apple', 'orange', 'kiwi'] ['kiwi', 'grape', 'cherry'] ['apple', 'orange', 'kiwi'] ['orange', 'kiwi']
# nested list/two dimentional array
b = [['apple', 2], ['orange', 5], ['kiwi', 4], ['grape', 3], ['cherry', 25]]
# show the quantity of kiwi
print(b[2][1])
4
# make sure you know the following list methods
c = a.append('peach')
print(c) # NOTE that append() change a list in place and does not return a new list
print(a)
a.remove('kiwi')
print(a)
None ['apple', 'orange', 'kiwi', 'grape', 'cherry', 'peach'] ['apple', 'orange', 'grape', 'cherry', 'peach']
List comprehension formula:
new_list = [expression (if conditional for changing the value) for member in iterable (if conditional for filtering the value)]
d = [5, -2, 7, 3, -4, 10]
# create a list by replacing each number that is smaller than 9 with 'p' if positive, and 'n' if negative
d_new = ['p' if i > 0 else 'n' for i in d if i <9]
print(d_new)
['p', 'n', 'p', 'p', 'n']
d1 = []
for i in d:
if i<9:
if i>0:
d1.append('p')
else:
d1.append('n')
d1
['p', 'n', 'p', 'p', 'n']
# iterate over a list via list comprehension
[print(i) for i in d]
5 -2 7 3 -4 10
[None, None, None, None, None, None]
# another way
for i in d:
print(i)
5 -2 7 3 -4 10
# check out f strings: https://realpython.com/python-f-strings/
# if you want to get the index
for i in range(len(d)):
print(f'the number at index of {i} is {d[i]}')
the number at index of 0 is 5 the number at index of 1 is -2 the number at index of 2 is 7 the number at index of 3 is 3 the number at index of 4 is -4 the number at index of 5 is 10
A dictionary stores (key, value) pairs.
d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data
print(d['cat']) # Get an entry from a dictionary; prints "cute"
print('cat' in d) # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet' # Set an entry in a dictionary
print(d['fish']) # Prints "wet"
# print(d['monkey']) # KeyError: 'monkey' not a key of d
print(d.get('monkey', 'N/A')) # Get an element with a default; prints "N/A"
print(d.get('fish', 'N/A')) # Get an element with a default; prints "wet"
del d['fish'] # Remove an element from a dictionary
print(d.get('fish', 'N/A')) # "fish" is no longer a key; prints "N/A"
cute True wet N/A wet N/A
It is easy to iterate over the keys in a dictionary:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
legs = d[animal]
print('A %s has %d legs.' % (animal, legs))
# Prints "A person has 2 legs", "A cat has 4 legs", "A spider has 8 legs"
A person has 2 legs. A cat has 4 legs. A spider has 8 legs.
Unlike a list, a set is an unordered collection of distinct elements.
animals = {'cat', 'dog'}
print('cat' in animals) # Check if an element is in a set; prints "True"
print('fish' in animals) # prints "False"
animals.add('fish') # Add an element to a set
print('fish' in animals) # Prints "True"
print(len(animals)) # Number of elements in a set; prints "3"
animals.add('cat') # Adding an element that is already in the set does nothing
print(len(animals)) # Prints "3"
animals.remove('cat') # Remove an element from a set
print(len(animals)) # Prints "2"
True False True 3 3 2
A tuple is an (immutable) ordered list of values. A tuple is in many ways similar to a list; one of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot.
t = ('foo', 'bar', 'baz', 'qux', 'quux', 'corge')
print(t[0])
print(t[-1])
foo corge
Tuples can be used as keys in dictionaries and as elements of sets.
d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys
print(d)
t = (5, 6) # Create a tuple
print(type(t)) # Prints "<class 'tuple'>"
print(d[t]) # Given a key (5, 6), find the corresponding value in dictionary d.
{(0, 1): 0, (1, 2): 1, (2, 3): 2, (3, 4): 3, (4, 5): 4, (5, 6): 5, (6, 7): 6, (7, 8): 7, (8, 9): 8, (9, 10): 9}
<class 'tuple'>
5
Referece Chapter: https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html
import numpy as np
np.__version__
'1.20.3'
Creating Arrays from Python Lists
np.array([1, 4, 2, 5, 3])
array([1, 4, 2, 5, 3])
All elements in a NumPy array must be of the same type. If types do not match, NumPy will upcast if possible: integers are up-cast to floating point:
# up-cast
np.array([3.14, 4, 2, 3])
array([3.14, 4. , 2. , 3. ])
# explicit data type
np.array([1, 2, 3, 4], dtype='float32')
array([1., 2., 3., 4.], dtype=float32)
# nested lists result in multi-dimensional arrays
np.array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
np.array([range(i, i + 3) for i in [2, 4, 6]])
array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
Creating Arrays from Scratch
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)
array([[3.14, 3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14, 3.14]])
# Create an array filled with a linear sequence
# Starting at 0 (inclusive), ending at 20 (exlusive), stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
# Create an array of five values evenly spaced between 0 and 10
np.linspace(0, 10, 5)
array([ 0. , 2.5, 5. , 7.5, 10. ])
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))
array([[0.15585347, 0.57922258, 0.50938664],
[0.85283522, 0.60192893, 0.45325918],
[0.44066822, 0.19242312, 0.71168268]])
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))
array([[ 1.16787386, -1.29381827, -0.93148773],
[-0.4941656 , 0.72399706, 1.04018595],
[ 0.18327201, -0.51223134, 0.55574386]])
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))
array([[5, 9, 8],
[0, 3, 2],
[0, 6, 4]])
# Create a 3x3 identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
np.random.seed(42) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array: 3 rows x 4 columns
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array: 3 4 rows x 5 columns two-dimentional array
print(x1)
print("********")
print(x2)
print("********")
print(x3)
[6 3 7 4 6 9] ******** [[2 6 7 4] [3 7 7 2] [5 4 1 7]] ******** [[[5 1 4 0 9] [5 8 0 9 2] [6 3 8 2 4] [2 6 4 8 6]] [[1 3 8 1 9] [8 9 4 1 3] [6 7 2 0 3] [1 7 3 1 5]] [[5 9 3 5 1] [9 1 9 3 7] [6 8 7 4 1] [4 7 9 8 8]]]
print(f'x3 ndim: {x3.ndim}')
print(f'x3 shape: {x3.shape}')
print(f'x3 size: {x3.size}')
x3 ndim: 3 x3 shape: (3, 4, 5) x3 size: 60
print("dtype:", x3.dtype)
dtype: int32
To access a slice of an array x, use this:
x[start:stop:step]
The slice extends from the ‘start’ index and ends one item before the ‘stop’ index.
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1.
# you need to understand the following
print(x1[4])
print(x2[2, 0])
print(x2[2, :])
print(x1[::-1]) # all elements reversed
6 5 [5 4 1 7] [9 6 4 7 3 6]
print(x3[2]) # The element of index=2 in x3 which is a 4x5 array; same as x3[2,:]
[[5 9 3 5 1] [9 1 9 3 7] [6 8 7 4 1] [4 7 9 8 8]]
print(x3[2, 2, 4]) # third 4x5 array, row 3 and column 5, which gives you 5
print(x3[2, 1:4, 4]) # think what this is trying to do
1 [7 1 8]
This default behavior of Numpy is very useful: when we handle large datasets, this allows us to access and process part of the datasset without making copies of the underlying data (could be slow and costly).
# slices of a list are copies of the list
# changing slices does not change the list
a = [1, 2, 3, 4, 5]
b = a[2:4]
print(b)
b[1] = 9
print(b)
print(a)
[3, 4] [3, 9] [1, 2, 3, 4, 5]
# slices of a Numpy array are views of the array
# changing the slices will change the original array!!!
c = np.array([1, 2, 3, 4, 5])
#d = c[2] # note c[2] is not a slice of c, if you want to have one element as a slice use c[2:3]
d = c[2:4]
print(d)
d[1] = 9
print(d)
print(c)
[3 4] [3 9] [1 2 3 9 5]
If you want to make a copy of the slice, you have to use the copy() method:
# c won't change
e = c[2:4].copy()
print(e)
e[1] = 8
print(e)
print(c)
[3 9] [3 8] [1 2 3 9 5]
sort(), argsort()reshape()concatenate()split(), vsplit(): split arrays vertically or horizontallyvstack(), hstack(): stacking arrays vertically or horizontally# sorting
a = np.array([2, 5, 4, 8, 3])
b = np.sort(a)
print(f'b is {b}')
print(f'a is not changed: {a}')
c = np.argsort(a) # argsort() returns the indexes of the sorted array
print(f'c is the indexes of the sorted a: {c}')
a.sort() # sorting in place, a is changed
print(f'a has been changed: {a}')
b is [2 3 4 5 8] a is not changed: [2 5 4 8 3] c is the indexes of the sorted a: [0 4 2 1 3] a has been changed: [2 3 4 5 8]
# reshaping
a = np.arange(1, 10)
print(f'a is a simple one dimensional array: {a}')
grid = a.reshape((3, 3))
print(f'a becomes a 3x3 grid after reshaping:')
print(grid)
a is a simple one dimensional array: [1 2 3 4 5 6 7 8 9] a becomes a 3x3 grid after reshaping: [[1 2 3] [4 5 6] [7 8 9]]
# Concatenation of arrays
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
print(np.concatenate([x, y]))
# + for numpy is sum, not concatenation
z = x + y
print(z)
# list is different, + is concatenation
a = [1, 2]
b = [3, 4]
print(a + b)
[1 2 3 3 2 1] [4 4 4] [1, 2, 3, 4]
axis 0 means row (default), axis 1 means column
grid = np.array([[1, 2, 3],
[4, 5, 6]])
# concatenate along the first axis (row), default
np.concatenate([grid, grid], axis=0)
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
# concatenate along the second axis (column) (zero-indexed)
np.concatenate([grid, grid], axis=1)
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
[6, 5, 4]])
# vertically stack the arrays
np.vstack([x, grid])
array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
# horizontally stack the arrays
y = np.array([[99],
[99]])
np.hstack([grid, y])
array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])
# splitting arrays
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2 = np.split(x, [3])
print(x1, x2)
[1 2 3] [99 99 3 2 1]
# splitting arrays
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 6])
print(x1, x2, x3)
[1 2 3] [99 99 3] [2 1]
# split vertically
grid = np.arange(16).reshape((4, 4))
grid
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
[[0 1 2 3] [4 5 6 7]] [[ 8 9 10 11] [12 13 14 15]]
x = np.arange(5)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) # floor division
x = [0 1 2 3 4] x + 5 = [5 6 7 8 9] x - 5 = [-5 -4 -3 -2 -1] x * 2 = [0 2 4 6 8] x / 2 = [0. 0.5 1. 1.5 2. ] x // 2 = [0 0 1 1 2]
print("-x = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2 = ", x % 2)
-x = [ 0 -1 -2 -3 -4] x ** 2 = [ 0 1 4 9 16] x % 2 = [0 1 0 1 0]
These arithmetic operations are convenient wrappers around specific functions built into NumPy. The following table lists the arithmetic operators implemented in NumPy:
+: np.add Addition -: np.subtract Subtraction -: np.negative Unary negation *: np.multiply Multiplication /: np.divide Division//: np.floor_divide Floor division **: np.power Exponentiation %: np.mod Modulus/remainder np.add(x,2)
array([2, 3, 4, 5, 6])
# Absolute value
x = np.array([-2, -1, 0, 1, 2])
abs(x)
array([2, 1, 0, 1, 2])
np.absolute(x)
array([2, 1, 0, 1, 2])
np.abs(x)
array([2, 1, 0, 1, 2])
# Exponents
x = [1, 2, 3, 4, 5]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3, x))
x = [1, 2, 3, 4, 5] e^x = [ 2.71828183 7.3890561 20.08553692 54.59815003 148.4131591 ] 2^x = [ 2. 4. 8. 16. 32.] 3^x = [ 3 9 27 81 243]
# Logorithms
x = [1, 2, 4, 10, 100]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
x = [1, 2, 4, 10, 100] ln(x) = [0. 0.69314718 1.38629436 2.30258509 4.60517019] log2(x) = [0. 1. 2. 3.32192809 6.64385619] log10(x) = [0. 0.30103 0.60205999 1. 2. ]
Aggregates available in NumPy can be extremely useful for summarizing a set of values.
As a simple example, let's consider the heights (cm) of US presidents.
heights_cm = [189, 170, 189, 163, 183, 171, 185, 168, 173, 183, 173, 173, 175, 178, 183, 193, 178, 173,
174, 183, 183, 168, 170, 178, 182, 180, 183, 178, 182, 188, 175, 179, 183, 193, 182, 183,
177, 185, 188, 188, 182, 185, 190, 183]
heights = np.array(heights_cm)
print(heights)
[189 170 189 163 183 171 185 168 173 183 173 173 175 178 183 193 178 173 174 183 183 168 170 178 182 180 183 178 182 188 175 179 183 193 182 183 177 185 188 188 182 185 190 183]
print("Mean height: ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height: ", heights.min())
print("Maximum height: ", heights.max())
Mean height: 180.04545454545453 Standard deviation: 6.957515705579717 Minimum height: 163 Maximum height: 193
print("25th percentile: ", np.percentile(heights, 25))
print("Median: ", np.median(heights))
print("75th percentile: ", np.percentile(heights, 75))
25th percentile: 174.75 Median: 182.0 75th percentile: 183.5
We assume you are using Python 3 in this course.
You've learned different data types in Python. Now, let's test your knowledge.
You need to use to format strings.
Write a program using the "f-strings" (https://realpython.com/python-f-strings/) and user input function to convert temperatures from Fahrenheit to Celsius. [Formula: Celsius = (Fahrenheit – 32)*5/9]
Hint: you may need int() or float() function
An example program output:
Please enter the temperature in Celcius: 140
140F is 60.0 in Celsius.
# an example of f-strings and user input function
username = input('What\'s your name? ')
print(f'Welcome, {username}!')
What's your name? N Welcome, N!
# complete your program here
f = input('Please enter the temperature in Fahrenheit: ')
f= int(f)
c = (f - 32)*5/9
print(f'{f}F is {c} in Celsius.')
Please enter the temperature in Fahrenheit: 140 140F is 60.0 in Celsius.
Now, do the oppsoite by convert temperatures from Celsius to Fahrenheit. Another example:
Please enter the temperature in Celcius: 60
60C is 140.0 in Fahrenheit.
# complete your program here
c = input('Please enter the temperature in Celcius: ')
c= int(c)
f = c*9/5+32
print(f'{c}C is {f} in Fahrenheit.')
Please enter the temperature in Celcius: 60 60C is 140.0 in Fahrenheit.
Next, let's make it a bit challenging by adding conditional statements to convert temperatures to and from Celsius, Fahrenheit. [Formula: Celsius/5 = (Fahrenheit – 32)/9]
An example program output:
Please enter the temperature: 60
Is this in Celsius or Fahrenheit? C
60C is 140 in Fahrenheit
Another example:
Please enter the temperature: 140
Is this Celsius or Fahrenheit? F
140F is 60 in Celsius
# complete your program here
d = input('Please enter the temperature: ')
m = input('Is this in Celsius or Fahrenheit? ')
C = True
F = False
if m=='C':
d= int(d)
f = d*9/5+32
print(f'{d}C is {f} in Fahrenheit.')
else:
d= int(d)
c = (d - 32)*5/9
print(f'{d}F is {c} in Celsius.')
Please enter the temperature: 60 Is this in Celsius or Fahrenheit? C 60C is 140.0 in Fahrenheit.
Have some fun with strings.
sentence = "Talk is cheap. Show me the code - Linus Torvalds"
# Find the length of this sentence.
print(len(sentence))
48
# Convert the whole string to lower case.
print(sentence.lower())
talk is cheap. show me the code - linus torvalds
# Extract a substring from the 10th character from left to the 10th character from right.
print(sentence[9:-9])
heap. Show me the code - Linus
# Split the sentence into a list of substrings by space ' '.
sentence.split()
['Talk', 'is', 'cheap.', 'Show', 'me', 'the', 'code', '-', 'Linus', 'Torvalds']
# Given the sentence, output a list of words in lowercase (without punctuations).
sentence.lower().replace('.','').replace('-','')
'talk is cheap show me the code linus torvalds'
# Count the total number of vowels in the sentence.
# a e i o u
count=0
for i in range(len(sentence)):
if sentence[i] == 'a':
count +=1
if sentence[i] == 'e':
count +=1
if sentence[i] == 'i':
count +=1
if sentence[i] == 'o':
count +=1
if sentence[i] == 'u':
count +=1
count
13
# Find the positions (indexes) of all vowels in the sentence.
for i in range(len(sentence)):
if sentence[i] =='a':
print(i)
if sentence[i] =='e':
print(i)
if sentence[i] =='i':
print(i)
if sentence[i] =='o':
print(i)
if sentence[i] =='u':
print(i)
1 5 10 11 17 21 25 28 30 35 37 41 44
Review list comprehension if needed: https://realpython.com/list-comprehension-python/
The formula for list comprehension is: new_list = [expression for member in iterable (if conditional)]
You need to do the following:
# Use a loop to create a list of 5 cube numbers and print the list:[0, 1, 8, 27, 64]
cubes = []
for i in range(5):
cubes.append(i*i*i)
cubes
[0, 1, 8, 27, 64]
# Use list comprehension to create the same list
[print(i*i*i) for i in range(5)]
0 1 8 27 64
[None, None, None, None, None]
# Find the postions (indexes) of all vowels in a sentence using list comprehension
sentence = "Toto, I\'ve a feeling we\'re not in Kansas anymore. - THE WIZARD OF OZ (1939)"
sen = []
for i in range(len(sentence)):
if sentence[i] == 'a':
sen.append(i)
if sentence[i] == 'A':
sen.append(i)
if sentence[i] == 'e':
sen.append(i)
if sentence[i] == 'E':
sen.append(i)
if sentence[i] == 'o':
sen.append(i)
if sentence[i] == 'O':
sen.append(i)
if sentence[i] == 'i':
sen.append(i)
if sentence[i] == 'I':
sen.append(i)
if sentence[i] == 'y':
sen.append(i)
sen
[1, 3, 6, 9, 11, 14, 15, 17, 22, 25, 28, 31, 35, 38, 41, 43, 45, 47, 54, 57, 59, 63, 66]
# Use a loop to create a list of first n numbers in the Fibonacci sequence
# [0, 1, 1, 2, 3, 5, 8, ...]
fibonacci = []
n = input('How many numbers do you want? ')
# add your code here
n=int(n)
if n==1:
fibonacci.append(0)
if n==2:
fibonacci = [0,1]
if n>=3:
fibonacci = [0,1]
a1 =0
a2= 1
for i in range(n-2):
a3=a1+a2
fibonacci.append(a3)
a1=a2
a2=a3
fibonacci
How many numbers do you want? 6
[0, 1, 1, 2, 3, 5]
This question is about Python dictionary.
| Vaccine | Efficacy |
|---|---|
| Pfizer | 95% |
| Moderna | 95% |
| AstraZeneca | 72% |
| Johnson & Johnson | 66% |
# Create a dictionary to save the data of vaccines and their efficacy. (save 95% as 0.95)
v_dict = {'Pfizer':0.95, 'Moderna':0.95, 'AstraZeneca':0.72, 'Johnson & Johnson': 0.66}
v = input('Which vaccine? ')
# print the efficacy of the vaccine
if v == 'Pfizer':
print(v_dict['Pfizer'])
if v == 'Moderna':
print(v_dict['Moderna'])
if v == 'AstraZeneca':
print(v_dict['AstraZeneca'])
if v == 'Johnson & Johnson':
print(v_dict['Johnson & Johnson'])
Which vaccine? Pfizer 0.95
# add another vaccine 'Sputnik V' with 91.4% efficacy rate to the dictionary
v_dict['Sputnik V']='0.914'
# add another vaccine 'Sinovac' with 50.38% efficacy rate to the dictionary
v_dict['Sinovac']='0.5038'
# Print vaccines with efficacy lower than 90%.
for vaccine in v_dict:
efficacy = v_dict[vaccine]
efficacy = float(efficacy)
if efficacy <= 0.9:
print(vaccine)
AstraZeneca Johnson & Johnson Sinovac
# delete the Sinovac vaccine from the dictionary.
del v_dict['Sinovac']
#Count the occurrence of each alphabet (no space ' ') in this sentence.
sentence = 'the quick brown fox jumps over the lazy dog'
# Save the result in a dictionary
ch_dict = {}
# Use a loop to iterate all characters
# add your code here
count = 0
for ch in sentence:
if ch in ch_dict:
sentence = sentence.replace(' ','')
else:
count = count + 1 #add a new entry ch_dict[ch] = 1
ch_dict[ch] = count # ch_dict[ch] increase by 1
# print the dictionary
print(ch_dict)
#{'a':!,'b':?,....}
{'t': 1, 'h': 2, 'e': 3, ' ': 4, 'q': 5, 'u': 6, 'i': 7, 'c': 8, 'k': 9, 'b': 10, 'r': 11, 'o': 12, 'w': 13, 'n': 14, 'f': 15, 'x': 16, 'j': 17, 'm': 18, 'p': 19, 's': 20, 'v': 21, 'l': 22, 'a': 23, 'z': 24, 'y': 25, 'd': 26, 'g': 27}
Next, we will practice with Numpy arrays.
import numpy as np
# create a 1-D array of first 20 positive even numbers (starting with 2)
np.arange(2,41,2)
array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34,
36, 38, 40])
# create a 1-D array of first 20 positive odd numbers (starting with 1)
np.arange(1,40,2)
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
35, 37, 39])
# create a 10x10 array in the following format:
# first row: [0,1,...,9]
# second row: [10,11,...,19]
# ...
# tenth row: [90,91,...,99]
np.array([range(i,i+10) for i in [0,10,20,30,40,50,60,70,80,90]])
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
# Create a 10x10 array of normally distributed random values
# with mean 0 and standard deviation 1
M=np.random.normal(0,1,(10,10))
M
array([[-0.56333266, 0.40553282, -2.05370671, 1.59359031, -0.72439689,
2.60265313, 0.17514359, -0.50682756, -0.14476835, 0.53481676],
[ 0.46396734, 0.2162695 , -1.10850437, 0.09346817, -0.75449798,
0.43278715, -0.05832841, -1.3821501 , -0.60899813, -0.31587092],
[-0.38180906, -0.52891104, -0.17708021, 0.42817029, 0.47933694,
1.46983879, 1.07611793, 0.21156835, 1.19462086, 0.53346308],
[-0.05955398, -0.96466629, -0.10790754, 0.62358469, -0.12344904,
0.42713814, -0.37758806, 0.72689269, 0.01174638, -0.74131715],
[ 0.14680571, -1.08269597, 0.20227451, 0.07984433, 1.5467686 ,
0.69817302, -0.71883371, -0.04516577, 1.04884854, 1.19944429],
[ 0.92465943, 1.40586636, 1.51558382, -1.00531496, 0.5230326 ,
1.6736601 , 1.27329869, 1.28972216, -0.0720363 , -1.02799967],
[-0.57969921, 0.23813579, 1.21620046, -2.71140871, 0.09631249,
0.84554337, 0.1974113 , -0.3359531 , 1.62908259, 0.21995197],
[-0.60045103, 0.34641445, -2.08536722, 0.48812552, -1.18127033,
0.22964099, 0.62572658, 0.8418479 , 0.46220815, 0.1639371 ],
[ 0.26325721, 1.4150008 , -0.84820039, 0.91181923, -0.87555151,
0.46128701, 1.56097568, -0.01882674, 0.53963081, -0.59584905],
[ 1.49240992, 1.41083169, -0.10465355, 0.5375298 , -0.67638303,
-0.71351734, -0.50615813, -0.08557204, -1.92047539, -0.67569526]])
# Count the numbers in M with an absolute value greater than 1.
print((abs(M)>1).sum())
29
# Reference: https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html
# Input the number of rows (input an integer >= 3)
r = input('Row count: ')
r=int(r)
# Input the number of rows (input another integer >= 3)
c = input('column count: ')
c=int(c)
# Hint: you may want to use the function int().
# Create a r x c array of random integers between 10 (inclusive) and 20 (exclusive).
arr = np.random.randint(9,20,(r,c))
arr
Row count: 4 column count: 4
array([[11, 11, 19, 14],
[11, 13, 17, 17],
[12, 15, 17, 9],
[17, 15, 19, 18]])
# output the element in the first row and the third column
print(arr[0,2])
19
# output the element in the second row and the last column
print(arr[1,3])
17
# output the second row of the array
print(arr[1])
[11 13 17 17]
# output the second last row of the array
print(arr[-1])
[17 15 19 18]
# output the second column of the array
print(arr[:,1])
[11 13 15 15]
# output the second last column of the array
print(arr[:,-1])
[14 17 9 18]
# output the upper-left 3x3 sub-array
print(arr[:3,:3])
[[11 11 19] [11 13 17] [12 15 17]]
# output the bottom-left 3x3 sub-array
print(arr[1:4,:3])
[[11 13 17] [12 15 17] [17 15 19]]
# output the upper-right 3x3 sub-array
print(arr[:3,1:4])
[[11 19 14] [13 17 17] [15 17 9]]
# output the bottom-right 3x3 sub-array
print(arr[1:4,1:4])
[[13 17 17] [15 17 9] [15 19 18]]
# Create a 1-D array from 1 to 50 (inclusive)
A = np.arange(1,51)
A
# Reshape A to a 5x10 array in the following format:
# first row: [1,...,10]
# second row: [11,...,20]
# ...
# fifth row: [41,...,50]
grid = A.reshape(5,10)
print(grid)
[[ 1 2 3 4 5 6 7 8 9 10] [11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30] [31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50]]
# add a row [51,...,60] to A
B=np.arange(51,61)
X=np.vstack([grid,B])
X
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60]])
# split the new A (6x10) to two 3x10 arrays A1 and A2
X1,X2=np.vsplit(X,[3])
print(X1)
print(X2)
[[ 1 2 3 4 5 6 7 8 9 10] [11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30]] [[31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50] [51 52 53 54 55 56 57 58 59 60]]
# Add 10 to each element of A
print(X+10)
[[11 12 13 14 15 16 17 18 19 20] [21 22 23 24 25 26 27 28 29 30] [31 32 33 34 35 36 37 38 39 40] [41 42 43 44 45 46 47 48 49 50] [51 52 53 54 55 56 57 58 59 60] [61 62 63 64 65 66 67 68 69 70]]
# Calculate the square root of each element in A
print(X**(1/2))
[[1. 1.41421356 1.73205081 2. 2.23606798 2.44948974 2.64575131 2.82842712 3. 3.16227766] [3.31662479 3.46410162 3.60555128 3.74165739 3.87298335 4. 4.12310563 4.24264069 4.35889894 4.47213595] [4.58257569 4.69041576 4.79583152 4.89897949 5. 5.09901951 5.19615242 5.29150262 5.38516481 5.47722558] [5.56776436 5.65685425 5.74456265 5.83095189 5.91607978 6. 6.08276253 6.164414 6.244998 6.32455532] [6.40312424 6.4807407 6.55743852 6.63324958 6.70820393 6.78232998 6.8556546 6.92820323 7. 7.07106781] [7.14142843 7.21110255 7.28010989 7.34846923 7.41619849 7.48331477 7.54983444 7.61577311 7.68114575 7.74596669]]
# Above, we have created a 10x10 array M of normally distributed random values
# with mean 0 and standard deviation 1
M.shape
(10, 10)
# Print the mean, standard devietion of all numbers in M
print(M.mean())
print(M.std())
-0.13033082284572617 0.9715862835045455
# Print the maximumn, 3rd quartile, median, 1st quartile, and minimum in M
print(M.max())
print(np.percentile(M,75))
print(np.median(M))
print(np.percentile(M,25))
print(M.min())
2.9602059790913238 0.4802969311230262 -0.05569680648809157 -0.7645931105828995 -2.5450729227255606